[jira] [Commented] (HADOOP-6837) Support for LZMA compression

2013-06-19 Thread Joydeep Sen Sarma (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13687660#comment-13687660
 ] 

Joydeep Sen Sarma commented on HADOOP-6837:
---

yes - the fb-hadoop tree has a working implementation. most of the original 
code came from Baidu.

we tried to convert many petabytes to lzma. (switching from gzip compressed 
rcfile to lzma compressed). aside from speed issues (writes are very slow in 
spite of trying our best to fiddle around with different lzma settings directly 
in code) - the problem is we got rare corruptions every once in a while. these 
didn't seem to have anything to do with hadoop code - but the lzma codec 
itself. certain blocks would be unreadable. we had to abandon the conversion 
project at that point.

my gut is that for small scale uses - the lzma stuff as implemented in 
fb-hadoop-20 works.

across petabytes of data - where every rcfile block (1MB) has multiple 
compressed streams (1 per column) - and we are literally opening and closing 
billions of compressed streams - there are latent bugs in lzma (that were well 
beyond our capability to debug - leave alone reproduce accurately).

we never had the same issues with gzip obviously (so the problem cannot be 
hadoop components like HDFS).

 Support for LZMA compression
 

 Key: HADOOP-6837
 URL: https://issues.apache.org/jira/browse/HADOOP-6837
 Project: Hadoop Common
  Issue Type: Improvement
  Components: io
Reporter: Nicholas Carlini
Assignee: Nicholas Carlini
 Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, 
 HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-2-20100806.patch, 
 HADOOP-6837-lzma-3-20100809.patch, HADOOP-6837-lzma-4-20100811.patch, 
 HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch


 Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
 generally achieves higher compression ratios than both gzip and bzip2.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-6837) Support for LZMA compression

2013-06-18 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13687627#comment-13687627
 ] 

Harsh J commented on HADOOP-6837:
-

FB's hadoop-0.20 seems to have a working implementation of this, although I do 
not know results of its stability yet: 
https://github.com/facebook/hadoop-20/blob/master/src/core/org/apache/hadoop/io/compress/LzmaCodec.java
 (and others).

 Support for LZMA compression
 

 Key: HADOOP-6837
 URL: https://issues.apache.org/jira/browse/HADOOP-6837
 Project: Hadoop Common
  Issue Type: Improvement
  Components: io
Reporter: Nicholas Carlini
Assignee: Nicholas Carlini
 Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, 
 HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-2-20100806.patch, 
 HADOOP-6837-lzma-3-20100809.patch, HADOOP-6837-lzma-4-20100811.patch, 
 HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch


 Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
 generally achieves higher compression ratios than both gzip and bzip2.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (HADOOP-6837) Support for LZMA compression

2010-11-08 Thread Erik Forsberg (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929540#action_12929540
 ] 

Erik Forsberg commented on HADOOP-6837:
---

Any progress on getting the new patch based on liblzma ready?

 Support for LZMA compression
 

 Key: HADOOP-6837
 URL: https://issues.apache.org/jira/browse/HADOOP-6837
 Project: Hadoop Common
  Issue Type: Improvement
  Components: io
Reporter: Nicholas Carlini
Assignee: Nicholas Carlini
 Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, 
 HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-2-20100806.patch, 
 HADOOP-6837-lzma-3-20100809.patch, HADOOP-6837-lzma-4-20100811.patch, 
 HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch


 Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
 generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-6837) Support for LZMA compression

2010-09-28 Thread Joydeep Sen Sarma (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915938#action_12915938
 ] 

Joydeep Sen Sarma commented on HADOOP-6837:
---

thanks to everyone on getting lzma into hadoop. it seems to be very promising.

i have tried applying the latest patch to both hadoop-0.20 (yahoo/facebook 
branch) and common- trunk. in both cases - when i try running TestCodec after 
compiling the native codec - i get a sigsegv:

[junit] Running org.apache.hadoop.io.compress.TestCodec
[junit] #
[junit] # An unexpected error has been detected by Java Runtime Environment:
[junit] #
[junit] #  SIGSEGV (0xb) at pc=0x2aaad5215659, pid=16028, tid=1076017472
[junit] #
[junit] # Java VM: Java HotSpot(TM) 64-Bit Server VM (10.0-b23 mixed mode 
linux-amd64)
[junit] # Problematic frame:
[junit] # C  [libhadoop.so.1.0.0+0x5659]  thisRead+0x49
[junit] #

separate from this - i had a question about tuning the compression level. in my 
testing on internal data using the lzma utility built from the SDK - i found a 
bunch of interesting option that provided a more suitable compromise between 
compression ratio/cpu (-a0 -mfhc4 -d24 -fbxxx) than the default. eyeing the 
'level' based normalization - it seems i won't be able to quite achieve the 
settings i want by specifying a level. so it seems that being able to configure 
these options separately would be very useful.

 Support for LZMA compression
 

 Key: HADOOP-6837
 URL: https://issues.apache.org/jira/browse/HADOOP-6837
 Project: Hadoop Common
  Issue Type: Improvement
  Components: io
Reporter: Nicholas Carlini
Assignee: Nicholas Carlini
 Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, 
 HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-2-20100806.patch, 
 HADOOP-6837-lzma-3-20100809.patch, HADOOP-6837-lzma-4-20100811.patch, 
 HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch


 Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
 generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-6837) Support for LZMA compression

2010-08-11 Thread Pelle Nilsson (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897277#action_12897277
 ] 

Pelle Nilsson commented on HADOOP-6837:
---

Do I read these comments correctly that LZMA2/xz is not included in the current 
patch, and might not be included as part of this issue since the LZMA Java lib 
does not support it?

 Support for LZMA compression
 

 Key: HADOOP-6837
 URL: https://issues.apache.org/jira/browse/HADOOP-6837
 Project: Hadoop Common
  Issue Type: Improvement
  Components: io
Reporter: Nicholas Carlini
Assignee: Nicholas Carlini
 Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, 
 HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-2-20100806.patch, 
 HADOOP-6837-lzma-3-20100809.patch, HADOOP-6837-lzma-c-20100719.patch, 
 HADOOP-6837-lzma-java-20100623.patch


 Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
 generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-6837) Support for LZMA compression

2010-08-11 Thread Hong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897391#action_12897391
 ] 

Hong Tang commented on HADOOP-6837:
---

Nicholas's has done some interesting works. But unfortunately I will -1 for 
marking it patch available. The currently patch carries a modified version of 
LZMA SDK. This is a huge maintenance overhead going forward where a much 
simpler solution clearly exists. We should explore the liblzma route first as I 
mentioned in http://bit.ly/cDz2Pk.

 Support for LZMA compression
 

 Key: HADOOP-6837
 URL: https://issues.apache.org/jira/browse/HADOOP-6837
 Project: Hadoop Common
  Issue Type: Improvement
  Components: io
Reporter: Nicholas Carlini
Assignee: Nicholas Carlini
 Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, 
 HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-2-20100806.patch, 
 HADOOP-6837-lzma-3-20100809.patch, HADOOP-6837-lzma-4-20100811.patch, 
 HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch


 Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
 generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-6837) Support for LZMA compression

2010-08-11 Thread Jakob Homan (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897400#action_12897400
 ] 

Jakob Homan commented on HADOOP-6837:
-

It may be good to address the code style issue now, since this patch diverges 
significantly from our standard: 
http://wiki.apache.org/hadoop/CodeReviewChecklist  Eclipse can re-format 
everything into Hadoop's style pretty well, if that will save time.

 Support for LZMA compression
 

 Key: HADOOP-6837
 URL: https://issues.apache.org/jira/browse/HADOOP-6837
 Project: Hadoop Common
  Issue Type: Improvement
  Components: io
Reporter: Nicholas Carlini
Assignee: Nicholas Carlini
 Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, 
 HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-2-20100806.patch, 
 HADOOP-6837-lzma-3-20100809.patch, HADOOP-6837-lzma-4-20100811.patch, 
 HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch


 Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
 generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-6837) Support for LZMA compression

2010-08-11 Thread Greg Roelofs (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897512#action_12897512
 ] 

Greg Roelofs commented on HADOOP-6837:
--

bq. The currently patch carries a modified version of LZMA SDK. This is a huge 
maintenance overhead going forward where a much simpler solution clearly exists.

If the modifications are rolled into the SDK, this issue goes away.  Nicholas, 
can you create a current diff of the src/contrib part relative to lzma912's 
original path structure (i.e., so it applies cleanly to a stock lzma912 
codebase)?  Then we can send it off to the 7Zip folks and see if they're 
willing to incorporate it.

(And only partly exists.  There's no Java in liblzma.  On the other hand, 
consensus around here seems to be that built-in Java support isn't necessary.)

bq. It may be good to address the code style issue now, since this patch 
diverges significantly from our standard

Only the src/contrib portion does, and that was intentional.  bzip2 is no 
longer actively developed, so an in-tree, heavily modified port is no big deal. 
 LZMA, however, is still a very active project, and if we ever wanted to 
upgrade to a newer release (e.g., for performance or correctness reasons), we 
do _not_ want a lot of whitespace noise hiding the real diffs.

But this issue also largely disappears if the substantive modifications are 
accepted upstream; then the formatting is fairly irrelevant, though still a 
pain for diffs and patches.  Either way, I don't think style rules are or 
should necessarily be applicable to contrib code (in the 
outside-the-core-codebase sense of contrib).

 Support for LZMA compression
 

 Key: HADOOP-6837
 URL: https://issues.apache.org/jira/browse/HADOOP-6837
 Project: Hadoop Common
  Issue Type: Improvement
  Components: io
Reporter: Nicholas Carlini
Assignee: Nicholas Carlini
 Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, 
 HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-2-20100806.patch, 
 HADOOP-6837-lzma-3-20100809.patch, HADOOP-6837-lzma-4-20100811.patch, 
 HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch


 Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
 generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-6837) Support for LZMA compression

2010-08-09 Thread Greg Roelofs (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12896739#action_12896739
 ] 

Greg Roelofs commented on HADOOP-6837:
--

bq. FakeOutputStream isn't the one I'm talking about in package.html. That's 
for the OutputStream/ FakeInputStream. FakeOutputStream is just the one where I 
couldn't justify the maximum acting correctly (wrtiting a max of 273 bytes 
extra) so I added the linked list in case anything goes wrong.

Argh, yes, of course...we discussed that at least twice already.  Sorry for 
spacing.

Did you ever instrument it to emit a warning if it did go above 273 extra bytes?

 Support for LZMA compression
 

 Key: HADOOP-6837
 URL: https://issues.apache.org/jira/browse/HADOOP-6837
 Project: Hadoop Common
  Issue Type: Improvement
  Components: io
Reporter: Nicholas Carlini
Assignee: Nicholas Carlini
 Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, 
 HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-2-20100806.patch, 
 HADOOP-6837-lzma-3-20100809.patch, HADOOP-6837-lzma-c-20100719.patch, 
 HADOOP-6837-lzma-java-20100623.patch


 Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
 generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-6837) Support for LZMA compression

2010-08-09 Thread Nicholas Carlini (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12896751#action_12896751
 ] 

Nicholas Carlini commented on HADOOP-6837:
--

I did. The average number of overflow bytes is 24. I never saw it go above 120. 
A quick sed/dc script tells me the standard deviation is 18. So I'm fairly sure 
that I am correct in that it will never go above 273. Trying with setting the 
number of fast bytes to 273 gives average of 37 and standard deviation of 26.

 Support for LZMA compression
 

 Key: HADOOP-6837
 URL: https://issues.apache.org/jira/browse/HADOOP-6837
 Project: Hadoop Common
  Issue Type: Improvement
  Components: io
Reporter: Nicholas Carlini
Assignee: Nicholas Carlini
 Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, 
 HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-2-20100806.patch, 
 HADOOP-6837-lzma-3-20100809.patch, HADOOP-6837-lzma-c-20100719.patch, 
 HADOOP-6837-lzma-java-20100623.patch


 Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
 generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-6837) Support for LZMA compression

2010-08-06 Thread Greg Roelofs (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12896213#action_12896213
 ] 

Greg Roelofs commented on HADOOP-6837:
--

20100806 version looks pretty good.  The first couple of issues are the main 
ones:

 * FakeOutputStream LinkedList of 1 KB buffers (BLOCKSIZE = 1024):  supposed to 
be 128 KB buffers (package.html)
 * directory structure = unholy mix of 7zip, Hadoop  (expected directory 
structure similar to LZMA SDK tarball contents, e.g.,  
lzma912/Java/SevenZip/Compression/LZMA/Decoder.java and lzma912/C/LzFind.c)
   ** OK to trim some of tarball's levels out (and to be different for Java, 
C), and top level need not be lzma912 (though it makes connection to download 
more obvious)--I know I suggested SevenZip, but I thought that appeared in 
both the Java and the C paths:  suggest lzma912 or lzma-9.12 instead
 * kDefaultDictionaryLogSize no longer changed to 16:  OK?
 * apparently bogus files:
   ** src/contrib/SevenZip/ivy/libraries.properties
   ** src/contrib/ec2/bin/hadoop-ec2-env.sh
 * LZMA SDK - LZMA SDK 9.12 in all boilerplate
 * CRC:  prefer to reuse existing (e.g., PureJavaCrc32); should be compatible
 * LzmaNativeInputStream.java:
   ** circularwould
   ** read():  fast busy-loop not thread-friendly...and not necessary:  
read(b[]) (InputStream) blocks until at least 1 byte available--zero returned 
only if b[] has length zero, which is not true of oneByte[]
   ** read():  t unnecessary; just do:  return (int)oneByte[0]  0xFF;
 or even:  return (ret == -1)? -1 : (int)oneByte[0]  0xFF;
   ** 113
 * LzmaNativeOutputStream.java:
   ** buffered, sendData - compressedDirectBuf, uncompressedDirectBuf as above
   ** 116
 * LzmaOutputStream.java:
   ** 117
 * Makefile.am
   ** All these are setup - All these are set up
   ** are also done - is also done

(Per previous feedback, please change all 1N to the Java equivalent of 
static const int kSomeBufferSize = M * 1024:  easier to read, easier to change 
later.)

Btw, it's always best to follow the existing style consistently than to use 
your own for your changes (with the possible exception of the boilerplate 
comment).  Perhaps Emacs hides the issue, but with 8-space tabs, your changes 
to the contrib LZMA files are a complete mismatch to their style.

Be sure to run ant javadoc (and fix any new issues) before the next patch, 
and give ant test a shot, too (over the weekend if you happen to see this--it 
takes several hours to run).  I'll work with you to get ant test-patch going.


 Support for LZMA compression
 

 Key: HADOOP-6837
 URL: https://issues.apache.org/jira/browse/HADOOP-6837
 Project: Hadoop Common
  Issue Type: Improvement
  Components: io
Reporter: Nicholas Carlini
Assignee: Nicholas Carlini
 Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, 
 HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-2-20100806.patch, 
 HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch


 Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
 generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-6837) Support for LZMA compression

2010-08-06 Thread Greg Roelofs (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12896214#action_12896214
 ] 

Greg Roelofs commented on HADOOP-6837:
--

On second thought, _if_ you put the version into the src/contrib path as I 
suggested, there's no need to add it to the boilerplate text, too.  That will 
make future forward-ports simpler (i.e., they can use the same boilerplate 
text).

 Support for LZMA compression
 

 Key: HADOOP-6837
 URL: https://issues.apache.org/jira/browse/HADOOP-6837
 Project: Hadoop Common
  Issue Type: Improvement
  Components: io
Reporter: Nicholas Carlini
Assignee: Nicholas Carlini
 Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, 
 HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-2-20100806.patch, 
 HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch


 Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
 generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-6837) Support for LZMA compression

2010-08-02 Thread Nicholas Carlini (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12894773#action_12894773
 ] 

Nicholas Carlini commented on HADOOP-6837:
--

Responding to the major comments -- will upload a patch that fixes these and 
the smaller comments soon.

FakeInputStream LinkedList:
This LinkedList can get fairly long, depending on how write is called. Worst 
case it can have upwards of 12 million elements, which is far beyond 
acceptable. This is the case if the write(single_byte) is called over and over. 
Each call will add a new link. Looking back at this, linked list probably 
wasn't the best way to go.

There are two (obvious) ways that write() could have worked. One is using 
linked lists as I did. The other way would be to create a byte array that can 
hold forceWriteLen bytes and just copy into it; however this can be as large as 
12MB. There are then two other ways to make this work. The first is just 
allocating the 12MB right up front. The other way is to start it with maybe 
just 64k, and make it grow (by powers of two) until it reaches 12MB, however 
this would end up arraycopying a little under 12MB in total more than the other 
solution. I will implement one of these for the patch.


FakeOutputStream LinkedList:
This linked list has a more reasonable use. Its purpose is to hold extra bytes 
just in case the input stream gives too many. I am fairly confident that at 
most 272 bytes (maximum number of fast bytes - 1) can be written to it. The 
reason I used a linked list, however, is that I couldn't formally prove this 
after going through code. I wanted to be safe and just in case their code 
doesn't behave as it should, everything will work on the OutputStream end.


Code(..., len)
I think I remember figuring out that Code(...) will return at least (but 
possibly more than) len bytes with the one exception that when the end of the 
stream is reached it will only read up to the end of the stream. I will modify 
the decompressor to no longer assume this and use the actual number of bytes 
read instead.


Fixed the inStream.read() bug (and will be in patch I upload). Added a while 
loop to read until EOF is reached so the assumptions are true.


Tail call recursive methods - while loop. Java should add tail-call 
optimizations when methods only call themselves recursively (which would 
require no changes to the bytecode).


Fixed memory leaks.

 Support for LZMA compression
 

 Key: HADOOP-6837
 URL: https://issues.apache.org/jira/browse/HADOOP-6837
 Project: Hadoop Common
  Issue Type: Improvement
  Components: io
Reporter: Nicholas Carlini
Assignee: Nicholas Carlini
 Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, 
 HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-c-20100719.patch, 
 HADOOP-6837-lzma-java-20100623.patch


 Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
 generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-6837) Support for LZMA compression

2010-07-30 Thread Greg Roelofs (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12894246#action_12894246
 ] 

Greg Roelofs commented on HADOOP-6837:
--

First, apologies for having steered Nicholas wrong on the liblzma issue.  As
Hong noted, it provides a much saner (that is, zlib-like) API for this sort
of thing, but I mistakenly thought it shared the GPL license of (parts of)
xz, so we ignored it and he worked on the LZMA SDK code instead.  (The latter
did include a Java port, however; liblzma does not.)

Overall, the 20100722 patch looks pretty decent (given the starting material),
but it does include some less-than-beautiful workarounds to cope with the
impedance mismatch between push- and pull-style I/O models.  In light of the
fact that liblzma is, in fact, public-domain software (every file under
xz-4.999.9beta-143-g3e49/src/liblzma is either explicitly in the public
domain or has been automatically generated by such a file), I'm going to ask
that Nicholas redo the native-code version to use liblzma rather than the
SDK.  (Unfortunately, it looks like the transformation from C SDK to liblzma
was a significant amount of work, so it doesn't appear that a trivial liblzma-
ification of the Java SDK code is likely.  If Nicholas concurs with that
assessment, we can instead file a separate JIRA to port the liblzma C code
to Java.)  Note that liblzma includes an LZMA2 codec, so Scott Carey's
splittable-codec suggestion is within reach, too.

OK, enough preamble.  There were a number of relatively minor style issues,
which I'll simply list below, but the five main concerns were:

 - FakeInputStream.java, FakeOutputStream.java:  the linked lists of byte
   arrays are tough to swallow, even given the push/pull problem, even given
   our previous discussions on the matter.  It would be good to know what the
   stats are on these things in typical cases--how frequently does overflow
   occur in LzmaInputStream, for example, and how many buffers are used?

 - Is the Code(..., len) call in LzmaInputStream guaranteed to produce len
   bytes if it returns true?  The calling read() function assumes this, but
   it's not at all obvious to me; the parameter is outSize in Code(), and I
   don't see that it's decremented or even really used at all (other than
   being stored in oldOutSize), unless it's buried inside macros defined
   elsewhere.

The next two (or perhaps three) are no longer directly relevant, but they're
general things to watch out for:

 - The return value from inStream.read() in LzmaNativeInputStream.java is
   ignored even though there's no guarantee the call will return the requested
   number of bytes.  A comment (never have to call ... again) reiterates
   this error.

 - There's no need for recursion in LzmaNativeOutputStream.java's write()
   method; iterative solutions tend to be far cleaner, I think.  (Probably
   ditto for LzmaNativeInputStream.java's read() method.)

 - LzmaCompressor.c has a pair of memleaks (state-outStream, state-inStream).

Here are the minor readability/maintainability/cosmetic/style issues:

 * add LZMA SDK version (apparently 9.12) and perhaps its release date to the
   boilerplate
 * tabs/formatting of LZMA SDK code (CRC.java, BinTree, etc.):  I _think_
   tabs are frowned upon in Hadoop, though I might be wrong; at any rate,
   they seem to be rarely used
   ** for easy Hadoop-style formatting, indent -i2 -br -l80 is a start (though
 it's sometimes confused by Java/C++ constructs)
 * reuse existing CRC implementation(s)?  (JDK has one, Hadoop has another)
 * prefer lowercase lzma for subdirs
 * use uppercase LZMA when referring to codec/algorithm (e.g., comments)
 * add README mentioning anything special/weird/etc. (e.g., weird hashtable
   issue); summary of changes made for Hadoop; potential Java/C diffs; binary
   compatibility between various output formats (other than trivial headers/
   footers); LZMA2 (splittable, not yet implemented); liblzma (much cleaner,
   more zlib-like implementation, still PD); etc.
 * ant javadoc run yet?  (apparently not recently)
 * line lengths, particularly comments (LzmaInputStream.java, etc.):  should
   be no more than 80 columns in general (Hadoop guidelines)
 * avoid generic variable names for globals and class members; use existing
   conventions where possible (e.g., look at gzip/zlib and bzip2 code)
 * LzmaCodec.java:
   ** uppercase LZMA when referring to codec/algorithm in general
   ** funcionality x 4
   ** throws ... { continuation line:  don't further indent
 * LZ/InWindow.java
   ** leftover debug code at end
 * RangeCoder/Encoder.java
   ** spurious blank line (else just boilerplate)
 * FakeOutputStream.java:
   ** stuffeed
   ** ammount
   ** isOverflow() - didOverflow()
 * LzmaInputStream.java:
   ** [uses FakeOutputStream]
   ** bufferd
   ** we 've
   ** index - overflowIndex (or similar):  too generic
   ** 

[jira] Commented: (HADOOP-6837) Support for LZMA compression

2010-07-26 Thread Nicholas Carlini (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12892425#action_12892425
 ] 

Nicholas Carlini commented on HADOOP-6837:
--

... that was supposed to go on HADOOP-6349, not here. Ignore that.

 Support for LZMA compression
 

 Key: HADOOP-6837
 URL: https://issues.apache.org/jira/browse/HADOOP-6837
 Project: Hadoop Common
  Issue Type: Improvement
  Components: io
Reporter: Nicholas Carlini
Assignee: Nicholas Carlini
 Attachments: hadoop-6349-2.patch, HADOOP-6837-lzma-1-20100722.patch, 
 HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch


 Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
 generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-6837) Support for LZMA compression

2010-07-19 Thread Hong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12890043#action_12890043
 ] 

Hong Tang commented on HADOOP-6837:
---

@nicolas, per our offline conversation last week, have you looked into whether 
the licensing of liblzma is suitable for inclusion in Hadoop? Liblzma seems 
better in the sense that its API resembles closely the APIs of other 
compression libraries like bzip or zlib and should shrink the amount of coding 
work needed to support C (and Java over JNI).

 Support for LZMA compression
 

 Key: HADOOP-6837
 URL: https://issues.apache.org/jira/browse/HADOOP-6837
 Project: Hadoop Common
  Issue Type: Improvement
  Components: io
Reporter: Nicholas Carlini
Assignee: Nicholas Carlini
 Attachments: HADOOP-6837-lzma-c-20100719.patch, 
 HADOOP-6837-lzma-java-20100623.patch


 Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
 generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-6837) Support for LZMA compression

2010-07-19 Thread Nicholas Carlini (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12890051#action_12890051
 ] 

Nicholas Carlini commented on HADOOP-6837:
--

I spoke with Greg about it just now and he said it would probably be better for 
me to work on FastLZ first, and come back to doing that latter.

 Support for LZMA compression
 

 Key: HADOOP-6837
 URL: https://issues.apache.org/jira/browse/HADOOP-6837
 Project: Hadoop Common
  Issue Type: Improvement
  Components: io
Reporter: Nicholas Carlini
Assignee: Nicholas Carlini
 Attachments: HADOOP-6837-lzma-c-20100719.patch, 
 HADOOP-6837-lzma-java-20100623.patch


 Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
 generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-6837) Support for LZMA compression

2010-06-29 Thread Scott Carey (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883630#action_12883630
 ] 

Scott Carey commented on HADOOP-6837:
-

What happens if you compress a tarball of those files instead?

Here are my results, on a directory with 4.1GB of ~64MB files.  The content is 
mixed binary/text (key/value data, binary keys, mixed binary/text values).

This is on CentOS 5.5, with the 'xz' and 'bzip2' packages installed via yum.

Compression / decompression speed.  Disk is capable of 200MB/sec read/write, 
16GB RAM, Nehalem based processor (Xeon E5620, 2.4Ghz).
Tests confirmed to be CPU bound with no iowait.   measurements are in MB/sec 
for the uncompressed data.
Source tarball, 4130 MB  (100%)  
|| type | compressed size | compressed size as percent | time to compress | 
compression rate | time to decompress | decompression rate | 
|gzip -1|  1430MB | (34.6%)| 105 s| (39.3 MB/sec)| 42 s | 98.3 MB/sec  |
|gzip -6|  1240MB | (30.0%)| 251 s| (16.5 MB/sec)| 41.5s  | 99.5 MB/sec |
|bzip2 -2|  1003MB | (24.3%)| 656 s| (6.3 MB/sec)| 168 s | 24.6 MB/sec |
|bzip2 -6| 942MB | (22.8%)| 725 s| (5.7 MB/sec)| 176 s | 23.5 MB/sec |
|bzip2 -9| 926MB | (22.4%)| 763 s| (5.4 MB/sec)| 181 s | 22.8 MB/sec |
|xz -2|  993MB | (24.0%)| 429 s| (9.63 MB/sec)| 95s | 43.5 MB/sec |
|xz -6| 794MB | (19.2%)| 2861 s| (1.44 MB/sec)| 83s | 49.7 MB/sec |

Note that on today's newest processors, gzip decompresses at gigabit ethernet 
speeds.  xz is half that, and bzip2 about half that again.  Gzip ane zx 
decompress faster at higher compression ratios, bzip2 decompresses slower at 
higher ratios.  All compress slower the higher the ratio, but bzip2 only slows 
down by ~20% or so from the fast to slow settings, while gzip and xz slow down 
by a factor of 10+ (I did not do -9 tests here for those, they are very slow).

IMO, since xz-2 is almost 2x as fast at compression and decompression as bzip2, 
and similar in compression ratio, it leaves little room for bzip2's use.
At higher compression levels, xz is very slow to compress, but achieves 
compression ratios significantly better than anything else and still 
decompresses very fast, so its great for archival storage.

For faster compression, gzip -1 or lzo and other compression types without an 
entropy coder are the only options.

The link I provided above has several cases where xz is 3 or more times faster 
than bzip2 at decompression, but my data doesn't behave that way.


Raw Data:

$ time cat packed.tar | gzip -c1  packed.gz1
real1m44.938s
user1m42.200s
sys 0m5.300s

$ time cat packed.tar | gzip -c6  packed.gz6
real4m11.051s
user4m8.438s
sys 0m5.317s

$ time cat packed.tar | bzip2 -2  packed.bz2-2
real10m55.795s
user10m52.989s
sys 0m5.030s

$ time cat packed.tar | bzip2 -6  packed.bz2-6
real12m4.847s
user12m2.049s
sys 0m5.345s

$ time cat packed.tar | bzip2 -9  packed.bz2-9
real12m43.063s
user12m40.353s
sys 0m4.797s


$ time cat packed.tar | xz -zv -2 -  packed.xz
  100.0 % 991.1 MiB / 4,125.0 MiB = 0.240   9.6 MiB/s 7:09
real7m9.369s
user7m6.985s
sys 0m7.140s

$ time cat packed.tar | xz -zv -6 -  packed.xz6
  100.0 % 792.6 MiB / 4,125.0 MiB = 0.192   1.4 MiB/s47:41
real47m41.033s
user47m37.794s
sys 0m8.371s

--
Tests of decompression: 

$ time cat packed.gz1 | gunzip   /dev/null
real0m42.081s
user0m41.814s
sys 0m1.361s

$ time cat packed.gz6 | gunzip   /dev/null
real0m41.512s
user0m41.021s
sys 0m1.086s


$ time cat packed.bz2-2 | bunzip2   /dev/null
real2m48.528s
user2m48.014s
sys 0m1.455s

$ time cat packed.bz2-6 | bunzip2   /dev/null
real2m56.511s
user2m55.999s
sys 0m1.302s

$ time cat packed.bz2-9 | bunzip2   /dev/null
real3m1.064s
user3m0.559s
sys 0m1.409s

$ time cat packed.xz | xz -dc   /dev/null
real1m35.239s
user1m34.873s
sys 0m1.301s

$ time cat packed.xz6 | xz -dc   /dev/null
real1m23.219s
user1m22.771s
sys 0m1.126s



 Support for LZMA compression
 

 Key: HADOOP-6837
 URL: https://issues.apache.org/jira/browse/HADOOP-6837
 Project: Hadoop Common
  Issue Type: Improvement
  Components: io
Reporter: Nicholas Carlini
Assignee: Nicholas Carlini
 Attachments: HADOOP-6837-lzma-java-20100623.patch


 Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
 generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-6837) Support for LZMA compression

2010-06-28 Thread Greg Roelofs (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883265#action_12883265
 ] 

Greg Roelofs commented on HADOOP-6837:
--

Scott Carey wrote:

bq. lzma always decompresses 2 to 7 times as fast as bzip2 (only ~ half the 
decompression speed of gzip).

I didn't see that in my tests.  My measurements (last column) are in terms of 
compressed MB/sec, i.e., scaled by the compression ratio, but the ratios are 
close enough that that isn't a big factor:

{noformat}
bzip2-1: text = 78.9% (1.1),   1.464 (0.028) ucMB/sec,   1.189 (0.037) cMB/sec
 bin  = 50.1% (3.4),   1.395 (0.021) ucMB/sec,   2.170 (0.036) cMB/sec
bzip2-9: text = 80.5% (1.0),   1.415 (0.028) ucMB/sec,   1.135 (0.037) cMB/sec
 bin  = 51.6% (3.6),   1.340 (0.020) ucMB/sec,   1.878 (0.032) cMB/sec

xz-1:text = 79.6% (1.0),   2.705 (0.097) ucMB/sec,   1.457 (0.049) cMB/sec
 bin  = 53.3% (3.5),   1.820 (0.031) ucMB/sec,   2.93  (0.20)  cMB/sec
xz-9:text = 82.4% (0.8),   0.240 (0.011) ucMB/sec,   1.433 (0.051) cMB/sec
 bin  = 57.2% (3.6),   0.351 (0.010) ucMB/sec,   2.73  (0.17)  cMB/sec
{noformat}

So xz/LZMA is definitely faster to decompress, but not immensely so.  (This was 
all C code.  The text and bin measurements are averages across roughly 350 
files of each type, various sizes.  Not a perfect corpus, but it should be 
varied enough to draw some reasonable conclusions.  On the other hand, the file 
sizes are definitely much smaller than is typical in Hadoop jobs.)

Btw, I didn't see Nicholas mention it, but all of the LZMA variants he tested 
appear to be stream-compatible--that is, any of the tools can decompress any of 
the others' streams, possibly modulo some header-parsing.

 Support for LZMA compression
 

 Key: HADOOP-6837
 URL: https://issues.apache.org/jira/browse/HADOOP-6837
 Project: Hadoop Common
  Issue Type: Improvement
  Components: io
Reporter: Nicholas Carlini
Assignee: Nicholas Carlini
 Attachments: HADOOP-6837-lzma-java-20100623.patch


 Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
 generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-6837) Support for LZMA compression

2010-06-24 Thread Scott Carey (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882214#action_12882214
 ] 

Scott Carey commented on HADOOP-6837:
-

Isn't there a new variant of LZMA (file extension xz) that uses LZMA2 and is 
block based (and therefore splittable)?  We should definitely make sure that is 
the variant we want to support.

LZMA is slower than gzip, but compresses better than both bzip2 and gzip.  It 
is also optimized for fast decompression -- it decompresses significantly 
faster than bzip2 (but not as fast as gzip).

This link is useful for understanding the performance / compression ratio 
differences across the various compression levels provided for each:

http://tukaani.org/lzma/benchmarks.html


LZO, FastLZ, LZF, and the like are all faster than the above three but compress 
at a lower ratio.  With LZMA support (hopefully .xz files, not the older 7zip) 
there is little reason to use bzip2 anymore -- lzma level 2 compresses as fast 
as bzip2 level 1, but has a compression ratio as high as bzip2 level 9.  lzma 
always decompresses 2 to 7 times as fast as bzip2 (only ~ half the 
decompression speed of gzip). 

It is the ideal archival storage format.   

 Support for LZMA compression
 

 Key: HADOOP-6837
 URL: https://issues.apache.org/jira/browse/HADOOP-6837
 Project: Hadoop Common
  Issue Type: Improvement
  Components: io
Reporter: Nicholas Carlini
Assignee: Nicholas Carlini
 Attachments: HADOOP-6837-lzma-java-20100623.patch


 Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
 generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-6837) Support for LZMA compression

2010-06-24 Thread Nicholas Carlini (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882250#action_12882250
 ] 

Nicholas Carlini commented on HADOOP-6837:
--

The Java code from the SDK hasn't been updated since version 4.61 (which is as 
of 23 November, 2008), so support for LZMA2 would need to rely on C code, or be 
ported to Java. 

The compression ratios of LZMA and LZMA2 are nearly identical (+/- .01% from 
the tests I did). It does look like LZMA2 is block based and is splittable, so 
that would be a major plus for it.

On the differences between LZMA and LZMA2:

nbsp; nbsp; nbsp; nbsp; nbsp; LZMA2 is an extension on top of the original 
LZMA. LZMA2 uses
nbsp; nbsp; nbsp; nbsp; nbsp; LZMA internally, but adds support for 
flushing the encoder,
nbsp; nbsp; nbsp; nbsp; nbsp; uncompressed chunks, eases stateful decoder 
implementations,
nbsp; nbsp; nbsp; nbsp; nbsp; and improves support for multithreading.

http://tukaani.org/xz/xz-file-format.txt

I did have to add support for flushing the encoder to the Java code (flushing 
the encoder still produces valid lzma-compressed output).

 Support for LZMA compression
 

 Key: HADOOP-6837
 URL: https://issues.apache.org/jira/browse/HADOOP-6837
 Project: Hadoop Common
  Issue Type: Improvement
  Components: io
Reporter: Nicholas Carlini
Assignee: Nicholas Carlini
 Attachments: HADOOP-6837-lzma-java-20100623.patch


 Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
 generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-6837) Support for LZMA compression

2010-06-23 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881915#action_12881915
 ] 

Allen Wittenauer commented on HADOOP-6837:
--

The 7z SDK license is Public Domain and 7z LZMA is LGPL.  Is that compatible 
with the APL?  

 Support for LZMA compression
 

 Key: HADOOP-6837
 URL: https://issues.apache.org/jira/browse/HADOOP-6837
 Project: Hadoop Common
  Issue Type: Improvement
  Components: io
Reporter: Nicholas Carlini
 Attachments: HADOOP-6837-lzma-java-20100623.patch


 Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
 generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-6837) Support for LZMA compression

2010-06-23 Thread Greg Roelofs (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881928#action_12881928
 ] 

Greg Roelofs commented on HADOOP-6837:
--

7-Zip is LGPL; the LZMA SDK is not:

License

LZMA SDK is placed in the public domain.

Given that both packages are hosted at the same site, with links to each other 
on the left bar, I think we can safely assume they know the difference between 
the two and have made a conscious decision to release them accordingly.

 Support for LZMA compression
 

 Key: HADOOP-6837
 URL: https://issues.apache.org/jira/browse/HADOOP-6837
 Project: Hadoop Common
  Issue Type: Improvement
  Components: io
Reporter: Nicholas Carlini
Assignee: Nicholas Carlini
 Attachments: HADOOP-6837-lzma-java-20100623.patch


 Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
 generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-6837) Support for LZMA compression

2010-06-23 Thread Eli Collins (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881929#action_12881929
 ] 

Eli Collins commented on HADOOP-6837:
-

Hey Nicholas,

Cool stuff.  Unfortunately LGPL in incompatible with APL so we couldn't check 
this in. See more at http://www.apache.org/legal/resolved.html

bq. The LGPL is ineligible primarily due to the restrictions it places on 
larger works, violating the third license criterion. Therefore, LGPL-licensed 
works must not be included in Apache products

Do you need to use this particular codec or are you just looking for something 
better than gzip/bzip2? If the latter HADOOP-6349 (support for FastLZ) would be 
a great place to direct your efforts, it's got a compatible license and like 
LZMA is significantly faster than gzip/bzip (and faster than the open source 
version of lzo). 

Thanks,
Eli


 Support for LZMA compression
 

 Key: HADOOP-6837
 URL: https://issues.apache.org/jira/browse/HADOOP-6837
 Project: Hadoop Common
  Issue Type: Improvement
  Components: io
Reporter: Nicholas Carlini
Assignee: Nicholas Carlini
 Attachments: HADOOP-6837-lzma-java-20100623.patch


 Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
 generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-6837) Support for LZMA compression

2010-06-23 Thread Eli Collins (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881931#action_12881931
 ] 

Eli Collins commented on HADOOP-6837:
-

If LZMA is public domain then it should safe to include. Would be good to have 
clarification from the author.

 Support for LZMA compression
 

 Key: HADOOP-6837
 URL: https://issues.apache.org/jira/browse/HADOOP-6837
 Project: Hadoop Common
  Issue Type: Improvement
  Components: io
Reporter: Nicholas Carlini
Assignee: Nicholas Carlini
 Attachments: HADOOP-6837-lzma-java-20100623.patch


 Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
 generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-6837) Support for LZMA compression

2010-06-23 Thread Greg Roelofs (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881934#action_12881934
 ] 

Greg Roelofs commented on HADOOP-6837:
--

LZMA is not faster than gzip/bzip2; it compresses better.  FastLZ (next item on 
Nicholas's plate) is faster than LZO but compresses more poorly than everything 
else (except maybe LZW).  They're both useful, but they address different parts 
of the problem domain.

 Support for LZMA compression
 

 Key: HADOOP-6837
 URL: https://issues.apache.org/jira/browse/HADOOP-6837
 Project: Hadoop Common
  Issue Type: Improvement
  Components: io
Reporter: Nicholas Carlini
Assignee: Nicholas Carlini
 Attachments: HADOOP-6837-lzma-java-20100623.patch


 Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
 generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-6837) Support for LZMA compression

2010-06-23 Thread Greg Roelofs (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881932#action_12881932
 ] 

Greg Roelofs commented on HADOOP-6837:
--

See the last line of their FAQ. ;-)

 Support for LZMA compression
 

 Key: HADOOP-6837
 URL: https://issues.apache.org/jira/browse/HADOOP-6837
 Project: Hadoop Common
  Issue Type: Improvement
  Components: io
Reporter: Nicholas Carlini
Assignee: Nicholas Carlini
 Attachments: HADOOP-6837-lzma-java-20100623.patch


 Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
 generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-6837) Support for LZMA compression

2010-06-23 Thread Nicholas Carlini (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881935#action_12881935
 ] 

Nicholas Carlini commented on HADOOP-6837:
--

Per the FAQ:

You can also read about the LZMA SDK, which is available under a more liberal 
license.

http://www.7-zip.org/faq.html

 Support for LZMA compression
 

 Key: HADOOP-6837
 URL: https://issues.apache.org/jira/browse/HADOOP-6837
 Project: Hadoop Common
  Issue Type: Improvement
  Components: io
Reporter: Nicholas Carlini
Assignee: Nicholas Carlini
 Attachments: HADOOP-6837-lzma-java-20100623.patch


 Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
 generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.