Re: [DISCUSS][JAVA] Designs & goals for readers/writers

Wes McKinney Tue, 23 Jul 2019 07:43:09 -0700

Yes I think text files are OK but I want to make sure that committers are
reviewing patches for binary files because there have been a number of
incidents in the past where I had to roll back patches to remove such
files.


On Tue, Jul 23, 2019, 10:37 AM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> Hi Wes,
> I haven't checked locally but that file at least for me renders as text
> file in GitHub (with an Apache header).  If we want all test data in the
> testing package I can make sure to move it but I thought text files might
> be ok in the main repo?
>
> Thanks,
> Micah
>
> On Tuesday, July 23, 2019, Wes McKinney <wesmck...@gmail.com> wrote:
>
>> I noticed that test data-related files are beginning to be checked in
>>
>>
>> https://github.com/apache/arrow/blob/master/java/adapter/avro/src/test/resources/schema/test.avsc
>>
>> I wanted to make sure this doesn't turn into a slippery slope where we
>> end up with several megabytes or more of test data files
>>
>> On Mon, Jul 22, 2019 at 11:39 PM Micah Kornfield <emkornfi...@gmail.com>
>> wrote:
>> >
>> > Hi Wes,
>> > Are there currently files that need to be moved?
>> >
>> > Thanks,
>> > Micah
>> >
>> > On Monday, July 22, 2019, Wes McKinney <wesmck...@gmail.com> wrote:
>> >>
>> >> Sort of tangentially related, but while we are on the topic:
>> >>
>> >> Please, if you would, avoid checking binary test data files into the
>> >> main repository. Use https://github.com/apache/arrow-testing if you
>> >> truly need to check in binary data -- something to look out for in
>> >> code reviews
>> >>
>> >> On Mon, Jul 22, 2019 at 10:38 AM Micah Kornfield <
>> emkornfi...@gmail.com> wrote:
>> >> >
>> >> > Hi Jacques,
>> >> > Thanks for the clarifications. I think the distinction is useful.
>> >> >
>> >> > If people want to write adapters for Arrow, I see that as useful but
>> very
>> >> > > different than writing native implementations and we should try to
>> create a
>> >> > > clear delineation between the two.
>> >> >
>> >> >
>> >> > What do you think about creating a "contrib" directory and moving
>> the JDBC
>> >> > and AVRO adapters into it? We should also probably provide more
>> description
>> >> > in pom.xml to make it clear for downstream consumers.
>> >> >
>> >> > We should probably come up with a name other than adapters for
>> >> > readers/writer ("converters"?) and use it in the directory structure
>> for
>> >> > the existing Orc implementation?
>> >> >
>> >> > Thanks,
>> >> > Micah
>> >> >
>> >> >
>> >> > On Sun, Jul 21, 2019 at 6:09 PM Jacques Nadeau <jacq...@apache.org>
>> wrote:
>> >> >
>> >> > > As I read through your responses, I think it might be useful to
>> talk about
>> >> > > adapters versus native Arrow readers/writers. Adapters are
>> something that
>> >> > > adapt an existing API to produce and/or consume Arrow data. A
>> native
>> >> > > reader/writer is something that understand the format directly and
>> does not
>> >> > > have intermediate representations or APIs the data moves through
>> beyond
>> >> > > those that needs to be used to complete work.
>> >> > >
>> >> > > If people want to write adapters for Arrow, I see that as useful
>> but very
>> >> > > different than writing native implementations and we should try to
>> create a
>> >> > > clear delineation between the two.
>> >> > >
>> >> > > Further comments inline.
>> >> > >
>> >> > >
>> >> > >> Could you expand on what level of detail you would like to see a
>> design
>> >> > >> document?
>> >> > >>
>> >> > >
>> >> > > A couple paragraphs seems sufficient. This is the goals of the
>> >> > > implementation. We target existing functionality X. It is an
>> adapter. Or it
>> >> > > is a native impl. This is the expected memory and processing
>> >> > > characteristics, etc.  I've never been one for huge amount of
>> design but
>> >> > > I've seen a number of recent patches appear where this is no
>> upfront
>> >> > > discussion. Making sure that multiple buy into a design is the
>> best way to
>> >> > > ensure long-term maintenance and use.
>> >> > >
>> >> > >
>> >> > >> I think this should be optional (the same argument below about
>> predicates
>> >> > >> apply so I won't repeat them).
>> >> > >>
>> >> > >
>> >> > > Per my comments above, maybe adapter versus native reader clarifies
>> >> > > things. For example, I've been working on a native avro read
>> >> > > implementation. It is little more than chicken scratch at this
>> point but
>> >> > > its goals, vision and design are very different than the adapter
>> that is
>> >> > > being produced atm.
>> >> > >
>> >> > >
>> >> > >> Can you clarify the intent of this objective.  Is it mainly to
>> tie in with
>> >> > >> the existing Java arrow memory book keeping?  Performance?
>> Something
>> >> > >> else?
>> >> > >>
>> >> > >
>> >> > > Arrow is designed to be off-heap. If you have large variable
>> amounts of
>> >> > > on-heap memory in an application, it starts to make it very hard
>> to make
>> >> > > decisions about off-heap versus on-heap memory since those
>> divisions are by
>> >> > > and large static in nature. It's fine for short lived applications
>> but for
>> >> > > long lived applications, if you're working with a large amount of
>> data, you
>> >> > > want to keep most of your memory in one pool. In the context of
>> Arrow, this
>> >> > > is going to naturally be off-heap memory.
>> >> > >
>> >> > >
>> >> > >> I'm afraid this might lead to a "perfect is the enemy of the good"
>> >> > >> situation.  Starting off with a known good implementation of
>> conversion to
>> >> > >> Arrow can allow us to both to profile hot-spots and provide a
>> comparison
>> >> > >> of
>> >> > >> implementations to verify correctness.
>> >> > >>
>> >> > >
>> >> > > I'm not clear what message we're sending as a community if we
>> produce low
>> >> > > performance components. The whole of Arrow is to increase
>> performance, not
>> >> > > decrease it. I'm targeting good, not perfect. At the same time,
>> from my
>> >> > > perspective, Arrow development should not be approached in the
>> same way
>> >> > > that general Java app development should be. If we hold a high
>> standard,
>> >> > > we'll have less total integrations initially but I think we'll
>> solve more
>> >> > > real world problems.
>> >> > >
>> >> > > There is also the question of how widely adoptable we want Arrow
>> libraries
>> >> > >> to be.
>> >> > >> It isn't surprising to me that Impala's Avro reader is an order of
>> >> > >> magnitude faster then the stock Java one.  As far as I know
>> Impala's is a
>> >> > >> C++ implementation that does JIT with LLVM.  We could try to use
>> it as a
>> >> > >> basis for converting to Arrow but I think this might limit
>> adoption in
>> >> > >> some
>> >> > >> circumstances.  Some organizations/people might be hesitant to
>> adopt the
>> >> > >> technology due to:
>> >> > >> 1.  Use of JNI.
>> >> > >> 2.  Use LLVM to do JIT.
>> >> > >>
>> >> > >> It seems that as long as we have a reasonably general interface to
>> >> > >> data-sources we should be able to optimize/refactor aggressively
>> when
>> >> > >> needed.
>> >> > >>
>> >> > >
>> >> > > This is somewhat the crux of the problem. It goes a little bit to
>> who our
>> >> > > consuming audience is and what we're trying to deliver. I'll also
>> say that
>> >> > > trying to build a high-quality implementation on top of low-quality
>> >> > > implementation or library-based adapter is worse than starting from
>> >> > > scratch. I believe this is especially true in Java where
>> developers are
>> >> > > trained to trust hotspot and that things will be good enough. That
>> is great
>> >> > > in a web app but not in systems software where we (and I expect
>> others)
>> >> > > will deploy Arrow.
>> >> > >
>> >> > >
>> >> > >> >    3. Propose a generalized "reader" interface as opposed to
>> making each
>> >> > >> >    reader have a different way to package/integrate.
>> >> > >>
>> >> > >> This also seems like a good idea.  Is this something you were
>> thinking of
>> >> > >> doing or just a proposal that someone in the community should
>> take up
>> >> > >> before we get too many more implementations?
>> >> > >>
>> >> > >
>> >> > > I don't have something in mind and didn't have a plan to build
>> something,
>> >> > > just want to make sure we start getting consistent early as
>> opposed to once
>> >> > > we have a bunch of readers/adapters.
>> >> > >
>>
>

Re: [DISCUSS][JAVA] Designs & goals for readers/writers

Reply via email to