Hey Paul, Feedback is definitely welcome on this... My thought was to create an easy to use API for relatively simple formats. By relatively simple I mean stuff like HTTPD, Syslog, LTSV, etc. Formats without heavily nested data. My real goal here was to abstract all the routine stuff that will be common to all those file types and that way all the new format plugins have to do is implement an iterator. If they need to play with the Drill internals, they can overwrite the EasyEVF methods. The LTSV doesn't have a fixed or known schema so I haven't dealt with that, but my plan is to add a few stub methods which will be overwritten to define a schema.
With that said, I am really out of date and rusty with my Generics, Abstract classes and the like so I'll need some pointers there. Thanks in advance! -- C > On Jan 26, 2020, at 9:20 PM, Paul Rogers <[email protected]> wrote: > > Hi Charles, > > Better APIs are always a good thing! > > The EVF ManagedReader interface has the minimum common denominator API: open, > next (batch) and close. > > We can create extensions that provide more structure such as with your > EasyEVFReader. For example: open() might: 1) fiddle with the DrillFileSystem > to open the file and seek to the block start location, 2) do something with > schema, 3) set up required shims. next() could do your steps 2-6: For example: > > > next() { > setupBatch(); > while (!loader.isFull()) { > if (!readRow()) { break; } > loadRow(); > } > finalizeBatch(); > } > > Many readers now use reader-specific "shims" to map from input columns/types > to EVF column writers. In this case, the above loadRow() can be defined to > iterate over the shims. > > The trick has always been that each reader is a bit different: different > objects are used, slightly different logic. It seemed simpler to let each > reader create its own code structure rather than create an elaborate general > structure that folks must learn. > > I'm looking forward to see what you created. > > Thanks, > - Paul > > > > On Sunday, January 26, 2020, 8:41:56 AM PST, Charles Givre > <[email protected]> wrote: > > Hello all > I wanted to share something that I’m working on and ask for feedback. I > started working on converting the LTSV format plugin to EVF and basically was > able to do that pretty quickly. This is a relatively simple format in that > it has one data type and no complex fields. > > Instead of just doing the conversion I wanted to see if we could put some > more abstraction on the format plugin architecture that would make it easier > for people to build format plugins without having to learn the various Drill > internals. I’m still working on the coding and will share once it is more > presentable. Basically I realized that every format plugin is at a high level > the same. > It has to > 1. Open the input source > 2. Read that data in > 3. Parse that data into rows > 4. Parse the rows into fields > 5. Map the fields into Drill structures > 6. Stop when it runs out of data. > > Steps 1 and 2 are virtually identical for every format plugin and hence that > was the low hanging fruit 🍎. Steps 3-5 sounded like an iterator to me and > step 6 again was something that could be hidden. > > So what I did was write an abstract class called EasyEVFReader which > abstracts virtually all of the file operations. It also includes utility > functions for schema definition (more on that later) and column mapping. > Basically all the developer has to do is > 1. Create an iterator class that reads the data and maps it to the rows > 2. Extend the EasyEVFReader class and assign the iterator to a variable. > > I’ll share the code tonight or tomorrow but I wanted to ask what people think > about the general approach. My goal was to get rid of the cut/paste code > that exists in so many plugins and greatly simplify the process. > Thanks! > > Sent from my iPhone
