Apologies for a late reply. The PR has been updated now. Would love to get your thoughts/suggestions.
On Tue, Jan 7, 2020 at 12:36 AM Luke Cwik <[email protected]> wrote: > Have you had a chance to update the PR? > > On Mon, Dec 30, 2019 at 5:00 AM Amogh Tiwari <[email protected]> wrote: > >> Hi Luke, >> >> We have gone through shevek/lzo-java, but we chose to go with >> airflow/aircompressor for the following reasons: >> >> 1) shevek/lzo-java is internally using .jni, .c and .h files, hence the >> GNU licence, and that would leave us with only choice of putting this as an >> option dependency >> >> 2) performance of airlift/aircompressor was much better as compared to >> shevek/lzo-java in terms of compression ratio and time taken for >> compression/decompression >> >> 3) airflow/aircompressor is in pure java and is under Apache licence >> >> Therefore, we'd prefer to go with adding the current implementation as >> optional. We'd require your inputs on the same as we are unsure on where we >> are supposed to keep the required files and how the final directory >> structure would like. We have an idea and we'll update the current PR >> accordingly. >> >> Please do guide us on this. >> >> >> Regards, >> >> Amogh Tiwari >> >> On Wed, Dec 18, 2019 at 4:42 AM Luke Cwik <[email protected]> wrote: >> >>> Sorry for the long delay (was on vacation). >>> >>> Using org.apache.hadoop isn't part of the Apache Beam Core module but is >>> a dependency for those who depend on the Apache Beam Hadoop module. So I >>> don't think swapping the com.facebook.presto.hadoop version for the >>> org.apache.hadoop version will address Ismael's concern about including >>> hadoop dependencies as part of the core. >>> >>> I looked at shevek/lzo-java[1] and I think its our best choice since it >>> is: >>> * pure Java >>> * GPLv3 (would require marking the dependency optional and telling users >>> to add it explicitly which we have done in the past as well) >>> * small (<100kb) >>> * small dependency tree (commons-logging & findbugs annotations if we >>> only depend on lzo-core) >>> * performance (github page claims 500mb/s compression and 800mb/s >>> decompression) >>> >>> Alternatively we can make the LZO compression an extension module (with >>> the facebook dependency) and use a registrar to have it loaded dynamically. >>> >>> 1: https://github.com/shevek/lzo-java >>> >>> On Fri, Dec 6, 2019 at 5:09 AM Amogh Tiwari <[email protected]> wrote: >>> >>>> While studying the code, we found that the airlift/ aircompressor >>>> library only requires some classes which are also present in apache hadoop >>>> common package. Therefore, we are now thinking that if we make changes in >>>> the airlift/ aircompressor package, remove the >>>> com.facebook.presto.hadoop and use the existing org.apache.hadoop >>>> <https://mvnrepository.com/artifact/org.apache.hadoop> package which >>>> is already included in beam. This will solve both #2 and #3 as the >>>> transitive dependency will be removed and the size will also be reduced by >>>> almost ~20mbs. >>>> >>>> But if we use this approach, we will have to manually change the util >>>> whenever any changes are made to the airlift library. >>>> >>>> On Wed, Dec 4, 2019 at 10:13 PM Luke Cwik <[email protected]> wrote: >>>> >>>>> Going with the Registrar/ServiceLoader route would allow for >>>>> alternative providers for the same compression algorithms so if they don't >>>>> like one they can always contribute a different one. >>>>> >>>>> On Wed, Dec 4, 2019 at 8:22 AM Ismaël Mejía <[email protected]> wrote: >>>>> >>>>>> (1) seems not to be the issue because it is Apache licensed. >>>>>> (2) and (3) are the big issues, because it requires a provided huge >>>>>> uber jar that essentially leaks Hadoop classes into core SDK [1] so it is >>>>>> definitely concerning. >>>>>> >>>>>> We discussed at some point during the PR that added ZStandard support >>>>>> about creating some sort of Registrar for compression algorithms [2] but >>>>>> we >>>>>> decided to not go ahead because we could achieve that for the zstd case >>>>>> via >>>>>> the optional dependencies of commons-compress. Maybe it is time to >>>>>> reconsider if such mechanism is worth. For example for users that may not >>>>>> care about having the hadoop leakage to be able to use LZO. >>>>>> >>>>>> Refs. >>>>>> [1] https://mvnrepository.com/artifact/io.airlift/aircompressor/0.16 >>>>>> [2] https://issues.apache.org/jira/browse/BEAM-6422 >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Dec 3, 2019 at 7:01 PM Robert Bradshaw <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Is there a way to wrap this up as an optional dependency with >>>>>>> multiple >>>>>>> possible providers, if there's no good library satisfying all of the >>>>>>> conditions (in particular (1))? >>>>>>> >>>>>>> On Tue, Dec 3, 2019 at 9:47 AM Luke Cwik <[email protected]> wrote: >>>>>>> > >>>>>>> > I was hoping that someone in the community would provide some >>>>>>> alternatives since there are quite a few implementations. >>>>>>> > >>>>>>> > On Tue, Dec 3, 2019 at 8:20 AM Amogh Tiwari <[email protected]> >>>>>>> wrote: >>>>>>> >> >>>>>>> >> Hi Luke, >>>>>>> >> >>>>>>> >> I agree with your thoughts and observations. But, >>>>>>> airlift:aircompressor is the only implementation of LZO in pure java. >>>>>>> That >>>>>>> straight away solves #5. >>>>>>> >> The other implementations that I found either have licensing >>>>>>> issues (since LZO natively uses GNU GPL licence) or are implemented >>>>>>> using >>>>>>> .c, .h and jni (which again make them dependent on the OS). Please refer >>>>>>> these: twitter/hadoop-lzo and shevek/lzo-java. >>>>>>> >> These were the main reasons why we based this on >>>>>>> airlift:aircompressor. >>>>>>> >> >>>>>>> >> Thanks and Regards, >>>>>>> >> Amogh >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> >> On Tue, Dec 3, 2019 at 2:59 AM Luke Cwik <[email protected]> >>>>>>> wrote: >>>>>>> >>> >>>>>>> >>> I took a look. My biggest concern is finding a good LZO >>>>>>> implementation. Looking for one that preferably has: >>>>>>> >>> 1) Apache license >>>>>>> >>> 2) Has zero transitive dependencies >>>>>>> >>> 3) Is small >>>>>>> >>> 4) Is performant >>>>>>> >>> 5) Is native java or supports execution on the three main OSs >>>>>>> (Windows, Linux, Mac) >>>>>>> >>> >>>>>>> >>> In your PR you suggested using io.airlift:aircompressor:0.16 >>>>>>> which doesn't meet item #2 and its transitive dependency fails #3. >>>>>>> >>> >>>>>>> >>> On Mon, Dec 2, 2019 at 12:16 PM Amogh Tiwari <[email protected]> >>>>>>> wrote: >>>>>>> >>>> >>>>>>> >>>> Hi, >>>>>>> >>>> I have filed a PR for an extension that will enable Apache Beam >>>>>>> to work with LZO/LZOP compression. Please refer. >>>>>>> >>>> I would love it if someone can take this up and review it. >>>>>>> >>>> Please feel free to share your thoughts/suggestions. >>>>>>> >>>> Regards, >>>>>>> >>>> Amogh >>>>>>> >>>>>>
