Sorry for the long delay (was on vacation). Using org.apache.hadoop isn't part of the Apache Beam Core module but is a dependency for those who depend on the Apache Beam Hadoop module. So I don't think swapping the com.facebook.presto.hadoop version for the org.apache.hadoop version will address Ismael's concern about including hadoop dependencies as part of the core.
I looked at shevek/lzo-java[1] and I think its our best choice since it is: * pure Java * GPLv3 (would require marking the dependency optional and telling users to add it explicitly which we have done in the past as well) * small (<100kb) * small dependency tree (commons-logging & findbugs annotations if we only depend on lzo-core) * performance (github page claims 500mb/s compression and 800mb/s decompression) Alternatively we can make the LZO compression an extension module (with the facebook dependency) and use a registrar to have it loaded dynamically. 1: https://github.com/shevek/lzo-java On Fri, Dec 6, 2019 at 5:09 AM Amogh Tiwari <[email protected]> wrote: > While studying the code, we found that the airlift/ aircompressor library > only requires some classes which are also present in apache hadoop common > package. Therefore, we are now thinking that if we make changes in the > airlift/ aircompressor package, remove the > com.facebook.presto.hadoop and use the existing org.apache.hadoop > <https://mvnrepository.com/artifact/org.apache.hadoop> package which is > already included in beam. This will solve both #2 and #3 as the transitive > dependency will be removed and the size will also be reduced by almost > ~20mbs. > > But if we use this approach, we will have to manually change the util > whenever any changes are made to the airlift library. > > On Wed, Dec 4, 2019 at 10:13 PM Luke Cwik <[email protected]> wrote: > >> Going with the Registrar/ServiceLoader route would allow for alternative >> providers for the same compression algorithms so if they don't like one >> they can always contribute a different one. >> >> On Wed, Dec 4, 2019 at 8:22 AM Ismaël Mejía <[email protected]> wrote: >> >>> (1) seems not to be the issue because it is Apache licensed. >>> (2) and (3) are the big issues, because it requires a provided huge uber >>> jar that essentially leaks Hadoop classes into core SDK [1] so it is >>> definitely concerning. >>> >>> We discussed at some point during the PR that added ZStandard support >>> about creating some sort of Registrar for compression algorithms [2] but we >>> decided to not go ahead because we could achieve that for the zstd case via >>> the optional dependencies of commons-compress. Maybe it is time to >>> reconsider if such mechanism is worth. For example for users that may not >>> care about having the hadoop leakage to be able to use LZO. >>> >>> Refs. >>> [1] https://mvnrepository.com/artifact/io.airlift/aircompressor/0.16 >>> [2] https://issues.apache.org/jira/browse/BEAM-6422 >>> >>> >>> >>> >>> On Tue, Dec 3, 2019 at 7:01 PM Robert Bradshaw <[email protected]> >>> wrote: >>> >>>> Is there a way to wrap this up as an optional dependency with multiple >>>> possible providers, if there's no good library satisfying all of the >>>> conditions (in particular (1))? >>>> >>>> On Tue, Dec 3, 2019 at 9:47 AM Luke Cwik <[email protected]> wrote: >>>> > >>>> > I was hoping that someone in the community would provide some >>>> alternatives since there are quite a few implementations. >>>> > >>>> > On Tue, Dec 3, 2019 at 8:20 AM Amogh Tiwari <[email protected]> >>>> wrote: >>>> >> >>>> >> Hi Luke, >>>> >> >>>> >> I agree with your thoughts and observations. But, >>>> airlift:aircompressor is the only implementation of LZO in pure java. That >>>> straight away solves #5. >>>> >> The other implementations that I found either have licensing issues >>>> (since LZO natively uses GNU GPL licence) or are implemented using .c, .h >>>> and jni (which again make them dependent on the OS). Please refer these: >>>> twitter/hadoop-lzo and shevek/lzo-java. >>>> >> These were the main reasons why we based this on >>>> airlift:aircompressor. >>>> >> >>>> >> Thanks and Regards, >>>> >> Amogh >>>> >> >>>> >> >>>> >> >>>> >> On Tue, Dec 3, 2019 at 2:59 AM Luke Cwik <[email protected]> wrote: >>>> >>> >>>> >>> I took a look. My biggest concern is finding a good LZO >>>> implementation. Looking for one that preferably has: >>>> >>> 1) Apache license >>>> >>> 2) Has zero transitive dependencies >>>> >>> 3) Is small >>>> >>> 4) Is performant >>>> >>> 5) Is native java or supports execution on the three main OSs >>>> (Windows, Linux, Mac) >>>> >>> >>>> >>> In your PR you suggested using io.airlift:aircompressor:0.16 which >>>> doesn't meet item #2 and its transitive dependency fails #3. >>>> >>> >>>> >>> On Mon, Dec 2, 2019 at 12:16 PM Amogh Tiwari <[email protected]> >>>> wrote: >>>> >>>> >>>> >>>> Hi, >>>> >>>> I have filed a PR for an extension that will enable Apache Beam to >>>> work with LZO/LZOP compression. Please refer. >>>> >>>> I would love it if someone can take this up and review it. >>>> >>>> Please feel free to share your thoughts/suggestions. >>>> >>>> Regards, >>>> >>>> Amogh >>>> >>>
