Re: How to implement AbstractRecordWriter

Nicolas A Perez Fri, 31 May 2019 12:16:35 -0700

Is there a chance we can into a webex call at some point so someone can
help me out with an initial test run?


On Fri, May 31, 2019 at 19:38 Paul Rogers <[email protected]> wrote:

> Hi Nicolas,
>
> Regarding your point that plugins should be, well, plugins -- independent
> of Drill code. Yes, that is true. But, no one has invested the time to make
> it so. Doing so would require a clear, stable code API; an easy way to
> develop such code without the need for the "build jar, copy to DRILL_HOME,
> restart Drill" approach that Charles mentioned.
>
> There were some recent improvements around the bootstrap file, which is
> great. In the mean while, and since the MapR plugin code is already part of
> Drill, let's see if we can get the "work within Drill" approach to work for
> you. Then, perhaps you can use your experience to suggest changes that
> could be made to achieve the "true plugin" goal. All the Drill contributors
> who are not part of the core Drill team would likely very much appreciate a
> true plugin capability.
>
>
> I use Eclipse, perhaps others who use IntelliJ can comment on the
> specifics of that IDE.
>
> Drill is divided into modules: your code in the contrib module depends on
> Drill code in java-exec, vector and so on. When I run tests in java-exec in
> Eclipse, Eclipse automatically detects and rebuilds changes in dependent
> modules such as common or vector. This establishes that Eclipse, at least,
> understands Maven dependencies.
>
>
> I seem to recall that I also got this to work when writing the Drill book
> when I created an example plugin in the contrib module. I don't recall
> having to change anything to get it to work. Perhaps others who have worked
> on other contrib modules can offer their experience.
>
>
> So, one thing to check is if the Maven dependencies are configured
> correctly for the MapR plugin.
>
> One issue which I thought we solved are test-time dependencies. Tim did
> some work to ensure that code in src/test is visible to downstream modules.
> Which symbols/constructs are causing you problems? Perhaps there is more to
> fix?
>
> For now, perhaps you can target the goal of getting the existing MapR
> plugin code to work properly in the IDE. This is supposed to work, so it
> might just be a matter of resolving a few specific glitches.
>
> Has anyone worked on the MapR DB plugin previously and can offer advice?
>
> Thanks,
> - Paul
>
>
>
>     On Friday, May 31, 2019, 10:10:14 AM PDT, Nicolas A Perez <
> [email protected]> wrote:
>
>  One of the issues I have is that I haven’t found a way to debug my tests
> from intelliJ. It continues to say that some constructs from other modules
> are missing.
>
> Also, I haven’t  found *simple* examples of how to write *simple* tests.
> Every time i look at the existing code, the tests are done in a different
> way.
>
> Now, on the other hand, pluggings should be independent from drill core
> modules. If you think about, i can easily write a library that can be
> injected into Spark without touching Spark code. For instance, the
> DataSource API will load the required parts from my code at run time. Drill
> does the same, but the problem is the coupling between drill and it’s
> extension points.
>
> On the tests side, you have another problem, you cannot easily tests your
> new modules unless they are within drill core code. Maybe it is time to
> decoupling the test framework from drill itself, too.
>
> On Fri, May 31, 2019 at 18:38 Paul Rogers <[email protected]>
> wrote:
>
> > Hi Nicolas,
> >
> > Charles outlined the choices quite well.
> >
> > Let's talk about your observation that you find it annoying to deal with
> > the full Drill code. There may be some tricks here that can help you.
> >
> > As you know, I've been revising the text reader and the "EVF" (row set
> > framework). Doing so requires a series of pull requests. To move fast,
> I've
> > found the following workflow to be helpful:
> >
> > * Use a machine with an SSD. A Mac is ideal. A Linux desktop also works
> > (mine uses Linux Mint.) The SSD makes rebuilds very fast.
> >
> > * Use unit tests for all your testing. For example, I created dozens of
> > unit tests for CSV files to exercise the text reader, and many more to
> > exercise the EVF. All development and testing consists of adding/changing
> > code, adding/changing tests, and stepping through the unit test and
> > underlying code to find bugs.
> >
> > * Use JUnit categories to run selected unit tests as a group.
> >
> > In most cases, you let your IDE do the build; you don't need Maven nor do
> > you need to build jar files. Edit a file, run a unit test from your IDE
> and
> > step through code. My edit/compile/debug cycle tends to be seconds.
> >
> > If, however, you find yourself using Maven to build Drill, then are
> > running unit tests from Maven, and attaching a debugger, then your
> > edit/compile/debug cycle will be 5+ minutes, which is going to be
> > irritating.
> >
> > If you are doing a full build so you can use SqlLine to test, then this
> > suggests it is time to write a unit test case for that issue so you can
> run
> > it from the IDE. Using the RowSet stuff makes such tests easy. See
> > TestCsvWithHeaders [1] for some examples.
> >
> > If you run from the IDE, and find things don't work then perhaps there is
> > a config issue. Do we have code that looks for a file in
> > $DRILL_HOME/whatever rather than using the class path? Is a required
> native
> > library not on the LD_LIBRARY_PATH for the IDE?
> >
> > Most unit tests are designed to be stateless. They read a file stored in
> > resources, or they write a test file, read the file, and discard the file
> > when done.
> >
> > You are using MapRDB to insert data, which, of course, is stateful. So,
> > perhaps your test can put the DB into a known start state, insert some
> > records, read those records, compare them with the expected results, and
> > clean up the state so you are ready for the next test run. Your target is
> > that edit/compile/debug cycle of a few seconds.
> >
> >
> > Overall, if you can master the art of running Drill, using unit tests, in
> > your IDE, you can move forward very quickly.
> >
> > Use Maven builds, and run tests via Maven, only when getting ready to
> > submit a PR. If you change, say, only the contrib module, you only need
> > build and test that module. If you also change exec, say, then you can
> just
> > build those two modules.
> >
> > To use categories, tag your tests as follows:
> >
> > @Category(RowSetTests.class) class MyTest ...
> >
> > (I'll send the Maven command line separately; I'm not on that machine at
> > the moment.)
> >
> >
> > Thanks much to the team members who helped make this happen. I've since
> > worked on other projects that don't have this power and it is truly a
> > grueling experience to wait for long builds and deploys after ever
> change.
> >
> >
> > Thanks,
> > - Paul
> >
> > [1]
> >
> https://github.com/apache/drill/blob/master/exec/java-exec/src/test/java/org/apache/drill/exec/store/easy/text/compliant/TestCsvWithHeaders.java
> >
> >
> >
> >
> >
> >    On Friday, May 31, 2019, 5:17:40 AM PDT, Charles Givre <
> > [email protected]> wrote:
> >
> >  Hi Nicolas,
> >
> > You have two options:
> > 1.  You can develop format plugins and UDFs in Drill by adding them to
> the
> > contrib/ folder and then test them with unit tests.  Take a look at this
> PR
> > as an example[1].  If you're intending to submit your work to Drill for
> > inclusion, this would be my recommendation as you can write the unit
> tests
> > as you go, and it doesn't take very long to build and you can debug.
> > 2.  Alternatively, you can package the code separately as shown here[2].
> > However, this option requires you to build it, then copy the jars over to
> > DRILL_HOME/jars/3rd_party along with any dependencies, then run Drill.
> I'm
> > not sure how you could write unit tests this way.
> >
> > I hope this helps.
> >
> >
> > [1]: https://github.com/apache/drill/pull/1749
> > [2]: https://github.com/cgivre/drill-excel-plugin
> >
> >
> > > On May 31, 2019, at 8:06 AM, Nicolas A Perez <[email protected]>
> > wrote:
> > >
> > > Paul,
> > >
> > > Is it possible to develop my plugin outside of the drill code, let's
> say
> > in
> > > my own repository and then package it and add it to the location where
> > the
> > > plugins live? Does that work, too? I just find annoying to deal with
> the
> > > full drill code in order to develop a plugin. At the same time, I might
> > > want to detach the development of plugins from the drill life cycle
> > itself.
> > >
> > > Please advise.
> > >
> > > Best Regards,
> > >
> > > Nicolas A Perez
> > >
> > > On Thu, May 30, 2019 at 9:58 PM Paul Rogers <[email protected]
> >
> > > wrote:
> > >
> > >> Hi Nicolas,
> > >>
> > >> A quick check of the code suggests that AbstractWriter is a
> > >> Json-serialized description of the physical plan. It represents the
> > >> information sent from the planner to the execution engine, and is
> > >> interpreted by the scan operator. That is, it is the "physical plan."
> > >>
> > >> The question is, how does the execution engine translate create the
> > actual
> > >> writer based on the physical plan? The only good example seems to be
> for
> > >> the FileSystemPlugin. That particular storage plugin is complicated by
> > the
> > >> additional layer of the format plugins.
> > >>
> > >> There is a bit of magic here. Briefly, Drill uses a BatchCreator to
> > create
> > >> your writer. It does so via some Java introspection magic. Drill looks
> > for
> > >> all subclases of BatchCreator, the uses the type of the second
> argument
> > to
> > >> the getBatch() method to find the correct class. This may mean that
> you
> > >> need to create one with MapRDBFormatPluginConfig as the type of the
> > second
> > >> argument.
> > >>
> > >> The getBatch() method then creates the CloseableRecordBatch
> > >> implementation. This is a full Drill operator, meaning it must handle
> > the
> > >> Volcano iterator protocol. Looks like you can perhaps use
> > WriterRecordBatch
> > >> as the writer operator itself. (See EasyWriterBatchCreator and follow
> > the
> > >> code to understand the plumbing.)
> > >>
> > >> You create a RecordWriter to do the actual work. AFAIK, MapRDB
> supports
> > >> JSON data model (at least in some form). If this is the version you
> are
> > >> working on, the fastest development path might just be to copy the
> > >> JsonRecordWriter, and replace the writes to JSON with writes to
> MapRDB.
> > At
> > >> least this gives you a place to start looking.
> > >>
> > >>
> > >> A more general solution would be to build the writer using some of the
> > >> recent additions to Drill such as the row set mechanisms for reading a
> > >> record batch. But, since copying the JSON approach provides a quick &
> > dirty
> > >> solution, perhaps that is good enough for this particular use case.
> > >>
> > >>
> > >> In our book, we recommend building each step one-by-one and doing a
> > quick
> > >> test to verify that each step works as you expect. If you create your
> > >> BatchCreator, but not the writer, things won't actually work, but you
> > can
> > >> set a breakpoint in the getBatch() method to verify the Drill did find
> > your
> > >> class. And so on.
> > >>
> > >>
> > >> Thanks,
> > >> - Paul
> > >>
> > >>
> > >>
> > >>    On Thursday, May 30, 2019, 3:05:39 AM PDT, Nicolas A Perez <
> > >> [email protected]> wrote:
> > >>
> > >> Can anyone give me an overview of how to implement
> AbstractRecordWriter?
> > >>
> > >> What are the mechanics it follows, what should I do and so on? It will
> > very
> > >> helpful.
> > >>
> > >> Best Regards,
> > >>
> > >> Nicolas A Perez
> > >> --
> > >>
> > >>
> >
> --------------------------------------------------------------------------------------------
> > >> Sent by Nicolas A Perez from my GMAIL account.
> > >>
> > >>
> >
> --------------------------------------------------------------------------------------------
> > >>
> > >
> > >
> > >
> > > --
> > >
> >
> --------------------------------------------------------------------------------------------
> > > Sent by Nicolas A Perez from my GMAIL account.
> > >
> >
> --------------------------------------------------------------------------------------------
> >
>
> --
> Nicolas A Perez from GMAIL MOBILE

-- 
Nicolas A Perez from GMAIL MOBILE

Re: How to implement AbstractRecordWriter

Reply via email to