$ _IFL="enwiki-20141008-pages-articles-multistream.xml.bz2" $ time bzip2 --test --verbose "${_IFL}" enwiki-20141008-pages-articles-multistream.xml.bz2: ok
real 93m51.202s user 92m31.600s sys 0m35.188s $ time bzip2 --decompress --verbose --keep "${_IFL}" enwiki-20141008-pages-articles-multistream.xml.bz2: done real 129m39.665s user 108m15.368s sys 7m18.684s $ _IFL="enwiki-latest-pages-articles.xml.bz2" $ time bzip2 --test --verbose "${_IFL}" enwiki-latest-pages-articles.xml.bz2: ok real 143m31.399s user 128m4.964s sys 1m9.108s $ time bzip2 --decompress --verbose --keep "${_IFL}" enwiki-latest-pages-articles.xml.bz2: done real 147m59.737s user 124m31.476s sys 8m1.516s $ On 10/13/20, Albretch Mueller <lbrt...@gmail.com> wrote: > As part of my corpora research work I have to work with such large > text files. Wikipedia dumps are bzip2 so I have been working with: > > commons/compress/compressors/bzip2/BZip2CompressorInputStream.html > > and I consistently notice that it just stops processing without an > error of any kind. > > I checked the file at the offset where it stops and I also checked > the file with the Linuz bzip2 utility and nothing seems to be wrong in > any way. The source file I used is: > > enwiki-20141008-pages-articles.xml.bz2 > > which you can get from: > > http://torrentz.pl/search?f=articles%20enwiki&safe=0 > > I am using exactly the code example you had on your user guide: > > commons-compress/commons-compress_User Guide.html > > > aBZ2IFl = IFl.getCanonicalPath(); > > File OFl = new File(aOFlNm); > aOFlNm = OFl.getCanonicalPath(); > // __ > InputStream NwIS = Files.newInputStream(Paths.get(aBZ2IFl)); > BufferedInputStream BIS = new BufferedInputStream(NwIS); > BZip2CompressorInputStream bz2IS = new BZip2CompressorInputStream(BIS); > > OutputStream NwOS = Files.newOutputStream(Paths.get(aOFlNm)); > int n = 0; > while (-1 != (n = bz2IS.read(bArBfr))) { NwOS.write(bArBfr, 0, n); > lTtlByts += n; } > NwOS.close(); > bz2IS.close(); > > but it stops abruptly: > > // __ aOFlNm: > |enwiki-20141008-pages-articles-multistream_20201012174009.440.xml| > // __ |2601| total bytes compressed into |12081280894| processed in > |2586| (ms), |1| (bytes/ms) > > real 0m2.955s > user 0m2.996s > sys 0m0.176s > > ~ > _OFL="enwiki-20141008-pages-articles-multistream_20201012174009.440.xml" > > ls -l "${_OFL}" > wc -l "${_OFL}" > > $ _OFL="enwiki-20141008-pages-articles-multistream_20201012174009.440.xml" > > $ ls -l "${_OFL}" > -r--r--r-- 1 lbrtchx lbrtchx 2601 Oct 12 17:40 > enwiki-20141008-pages-articles-multistream_20201012174009.440.xml > > $ wc -l "${_OFL}" > 41 enwiki-20141008-pages-articles-multistream_20201012174009.440.xml > > $ md5sum --text "${_OFL}" > 75c87a6650433b5cea4fef0bdae1cc1f > enwiki-20141008-pages-articles-multistream_20201012174009.440.xml > > $ sha1sum --text "${_OFL}" > 2799934309372685af919c17798e78c1796637ef > enwiki-20141008-pages-articles-multistream_20201012174009.440.xml > > $ file --brief "${_OFL}" > ASCII text > > $ > > // __ originally downloaded file checked and decompressed using Linux > bzip2 Version 1.0.6, 6-Sept-2010: > > $ which bzip2 > /bin/bzip2 > > _BZ2="bzip2_--version.txt" > bzip2 --version > "${_BZ2}" 2>&1 > cat "${_BZ2}" | head -n 1 > rm -f "${_BZ2}" > > $ _BZ2="bzip2_--version.txt" > $ bzip2 --version > "${_BZ2}" 2>&1 > $ cat "${_BZ2}" | head -n 1 > bzip2, a block-sorting file compressor. Version 1.0.6, 6-Sept-2010. > $ rm -f "${_BZ2}" > > // __ "testing" bz2 file > > $ _IFL="enwiki-20141008-pages-articles-multistream.xml.bz2" > > $ time bzip2 --test --verbose "${_IFL}" > enwiki-20141008-pages-articles-multistream.xml.bz2: ok > > real 93m51.202s > user 92m31.600s > sys 0m35.188s > > // __ decompressing bz2 file > > $ time bzip2 --decompress --verbose --keep "${_IFL}" > enwiki-20141008-pages-articles-multistream.xml.bz2: done > > real 129m39.665s > user 108m15.368s > sys 7m18.684s > $ > > // __ decompressed file > > _IFL="enwiki-20141008-pages-articles-multistream.xml" > ls -l "${_IFL}" > time wc -l "${_IFL}" > time md5sum --text "${_IFL}" > time sha1sum --text "${_IFL}" > file --brief "${_IFL}" > > $ _IFL="enwiki-20141008-pages-articles-multistream.xml" > > $ ls -l "${_IFL}" > -r--r--r-- 1 lbrtchx lbrtchx 50151236957 Oct 22 2014 > enwiki-20141008-pages-articles-multistream.xml > > $ time wc -l "${_IFL}" > 800855553 enwiki-20141008-pages-articles-multistream.xml > > real 26m13.664s > user 1m3.308s > sys 1m30.616s > > $ time md5sum --text "${_IFL}" > 1cfabd688427728794e7ae75dc93e84c > enwiki-20141008-pages-articles-multistream.xml > > real 27m39.208s > user 4m14.884s > sys 1m33.788s > > $ time sha1sum --text "${_IFL}" > e337572c1957a5a4d7625e3180e16f20e77749b1 > enwiki-20141008-pages-articles-multistream.xml > > real 30m40.383s > user 8m39.852s > sys 1m32.864s > > $ file --brief "${_IFL}" > HTML document, UTF-8 Unicode text, with very long lines > $ > > // __ file decompressed using common compress bz2 (decompressing worked > fine!) > > _IFL="enwiki-latest-pages-articles_20201013002000.103.xml" > ls -l "${_IFL}" > time wc -l "${_IFL}" > time md5sum --text "${_IFL}" > time sha1sum --text "${_IFL}" > file --brief "${_IFL}" > > $ _IFL="enwiki-latest-pages-articles_20201013002000.103.xml" > > $ ls -l "${_IFL}" > -rw-r--r-- 1 lbrtchx lbrtchx 50151236957 Oct 13 03:35 > enwiki-latest-pages-articles_20201013002000.103.xml > > $ time wc -l "${_IFL}" > 800855553 enwiki-latest-pages-articles_20201013002000.103.xml > > real 14m44.535s > user 3m55.816s > sys 1m22.816s > > $ time md5sum --text "${_IFL}" > 1cfabd688427728794e7ae75dc93e84c > enwiki-latest-pages-articles_20201013002000.103.xml > > real 16m14.680s > user 3m19.256s > sys 1m30.488s > > $ time sha1sum --text "${_IFL}" > e337572c1957a5a4d7625e3180e16f20e77749b1 > enwiki-latest-pages-articles_20201013002000.103.xml > > real 17m45.103s > user 7m29.988s > sys 1m29.540s > > $ file --brief "${_IFL}" > HTML document, UTF-8 Unicode text, with very long lines > > $ > > > // __ file decompressed using common compress bz2 (decompressing > somehow abruptly stopped) > > _OFL="enwiki-20141008-pages-articles-multistream_20201012174009.440.xml" > > $ ls -l "enwiki-20141008-pages-articles-multistream_20201012174009.440.xml" > -r--r--r-- 1 lbrtchx lbrtchx 2601 Oct 12 17:40 > enwiki-20141008-pages-articles-multistream_20201012174009.440.xml > > $ wc -l "enwiki-20141008-pages-articles-multistream_20201012174009.440.xml" > 41 enwiki-20141008-pages-articles-multistream_20201012174009.440.xml > > $ cat "enwiki-20141008-pages-articles-multistream_20201012174009.440.xml" > <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.9/" > xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" > xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.9/ > http://www.mediawiki.org/xml/export-0.9.xsd" version="0.9" > xml:lang="en"> > <siteinfo> > <sitename>Wikipedia</sitename> > <dbname>enwiki</dbname> > <base>http://en.wikipedia.org/wiki/Main_Page</base> > <generator>MediaWiki 1.25wmf1</generator> > <case>first-letter</case> > <namespaces> > <namespace key="-2" case="first-letter">Media</namespace> > <namespace key="-1" case="first-letter">Special</namespace> > <namespace key="0" case="first-letter" /> > <namespace key="1" case="first-letter">Talk</namespace> > <namespace key="2" case="first-letter">User</namespace> > <namespace key="3" case="first-letter">User talk</namespace> > <namespace key="4" case="first-letter">Wikipedia</namespace> > <namespace key="5" case="first-letter">Wikipedia talk</namespace> > <namespace key="6" case="first-letter">File</namespace> > <namespace key="7" case="first-letter">File talk</namespace> > <namespace key="8" case="first-letter">MediaWiki</namespace> > <namespace key="9" case="first-letter">MediaWiki talk</namespace> > <namespace key="10" case="first-letter">Template</namespace> > <namespace key="11" case="first-letter">Template talk</namespace> > <namespace key="12" case="first-letter">Help</namespace> > <namespace key="13" case="first-letter">Help talk</namespace> > <namespace key="14" case="first-letter">Category</namespace> > <namespace key="15" case="first-letter">Category talk</namespace> > <namespace key="100" case="first-letter">Portal</namespace> > <namespace key="101" case="first-letter">Portal talk</namespace> > <namespace key="108" case="first-letter">Book</namespace> > <namespace key="109" case="first-letter">Book talk</namespace> > <namespace key="118" case="first-letter">Draft</namespace> > <namespace key="119" case="first-letter">Draft talk</namespace> > <namespace key="446" case="first-letter">Education > Program</namespace> > <namespace key="447" case="first-letter">Education Program > talk</namespace> > <namespace key="710" case="first-letter">TimedText</namespace> > <namespace key="711" case="first-letter">TimedText talk</namespace> > <namespace key="828" case="first-letter">Module</namespace> > <namespace key="829" case="first-letter">Module talk</namespace> > <namespace key="2600" case="first-letter">Topic</namespace> > </namespaces> > </siteinfo> > $ > > // __ first 45 lines of decompressed file using Linux bzip2 > > _IFL="enwiki-20141008-pages-articles-multistream.xml" > > head -n 45 "${_IFL}" > > $ head -n 45 "${_IFL}" > <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.9/" > xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" > xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.9/ > http://www.mediawiki.org/xml/export-0.9.xsd" version="0.9" > xml:lang="en"> > <siteinfo> > <sitename>Wikipedia</sitename> > <dbname>enwiki</dbname> > <base>http://en.wikipedia.org/wiki/Main_Page</base> > <generator>MediaWiki 1.25wmf1</generator> > <case>first-letter</case> > <namespaces> > <namespace key="-2" case="first-letter">Media</namespace> > <namespace key="-1" case="first-letter">Special</namespace> > <namespace key="0" case="first-letter" /> > <namespace key="1" case="first-letter">Talk</namespace> > <namespace key="2" case="first-letter">User</namespace> > <namespace key="3" case="first-letter">User talk</namespace> > <namespace key="4" case="first-letter">Wikipedia</namespace> > <namespace key="5" case="first-letter">Wikipedia talk</namespace> > <namespace key="6" case="first-letter">File</namespace> > <namespace key="7" case="first-letter">File talk</namespace> > <namespace key="8" case="first-letter">MediaWiki</namespace> > <namespace key="9" case="first-letter">MediaWiki talk</namespace> > <namespace key="10" case="first-letter">Template</namespace> > <namespace key="11" case="first-letter">Template talk</namespace> > <namespace key="12" case="first-letter">Help</namespace> > <namespace key="13" case="first-letter">Help talk</namespace> > <namespace key="14" case="first-letter">Category</namespace> > <namespace key="15" case="first-letter">Category talk</namespace> > <namespace key="100" case="first-letter">Portal</namespace> > <namespace key="101" case="first-letter">Portal talk</namespace> > <namespace key="108" case="first-letter">Book</namespace> > <namespace key="109" case="first-letter">Book talk</namespace> > <namespace key="118" case="first-letter">Draft</namespace> > <namespace key="119" case="first-letter">Draft talk</namespace> > <namespace key="446" case="first-letter">Education > Program</namespace> > <namespace key="447" case="first-letter">Education Program > talk</namespace> > <namespace key="710" case="first-letter">TimedText</namespace> > <namespace key="711" case="first-letter">TimedText talk</namespace> > <namespace key="828" case="first-letter">Module</namespace> > <namespace key="829" case="first-letter">Module talk</namespace> > <namespace key="2600" case="first-letter">Topic</namespace> > </namespaces> > </siteinfo> > <page> > <title>AccessibleComputing</title> > <ns>0</ns> > <id>10</id> > $ > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org