I'm leaning a bit towards 1) but I would love to get some input from the Avro 
community as 1) depends also on their side as we will submit some patches 
upstream that need to be reviewed and someday also released.

Are AVRO committers subscribed here or should we reach out to them on their ML? 
Given that we are quite active in the C++ space currently, I feel that we can 
contribute quite some infrastructure in building and packaging that we do 
eitherway for Arrow. This might be quite helpful for a project. We have seen 
with Parquet where much of the development is just happening as it is part of 
Arrow. (Not suggesting to merge/fork the Avro codebase but just to apply some 
of the  best practices we learned while building Arrow).

Uwe

On Tue, Mar 5, 2019, at 4:57 PM, Wes McKinney wrote:
> I'd be +0.5 in favor of forking in this particular case. Since Avro is
> not vectorized (unlike Parquet and ORC) I suspect it may be more
> difficult to get the best performance using a general purpose API
> versus one that is more specialized to producing Arrow record batches.
> Given that has been relatively light C++ development activity in
> Apache Avro and no releases for 2 years it does give me pause.
> 
> We might want to look at Impala's Avro scanner, they are doing some
> LLVM IR cross-compilation also (they're using the Avro C++ library
> though)
> 
> https://github.com/apache/impala/blob/master/be/src/exec/hdfs-avro-scanner-ir.cc
> https://github.com/apache/impala/blob/master/be/src/exec/hdfs-avro-scanner.cc
> 
> On Tue, Mar 5, 2019 at 1:01 AM Micah Kornfield <emkornfi...@gmail.com> wrote:
> >
> > I'm looking at incorporating Avro in Arrow C++ [1]. It  seems that the Avro
> > C++ library APIs  have improved from the last release.  However, it is not
> > clear when a new release will be available (I asked on the  JIRA Item for
> > the next release [2] and received no response).
> >
> > I was wondering if there is a policy governing using other Apache projects
> > or how people felt about the following options:
> > 1.  Depend on a specific git commit through the third-party library system.
> > 2.  Copy the necessary source code temporarily to our project, and change
> > to using the next release when it is available.
> > 3.  Fork the code we need (the main benefit I see here is being able to
> > refactor it to avoid having to deal with exceptions, easier integration
> > with our IO system and one less 3rd party dependency to deal with).
> > 4.  Wait on the 1.9 release before proceeding.
> >
> > Thanks,
> > Micah
> >
> > [1] https://issues.apache.org/jira/browse/ARROW-1209
> > [2] https://issues.apache.org/jira/browse/AVRO-2250
>

Reply via email to