$ _IFL="enwiki-20141008-pages-articles-multistream.xml.bz2"
$ time bzip2 --test --verbose "${_IFL}"
enwiki-20141008-pages-articles-multistream.xml.bz2: ok
real 93m51.202s
user 92m31.600s
sys 0m35.188s
$ time bzip2 --decompress --verbose --keep "${_IFL}"
enwiki-20141008-pages-articles-multistream.xml.bz2: done
real 129m39.665s
user 108m15.368s
sys 7m18.684s
$ _IFL="enwiki-latest-pages-articles.xml.bz2"
$ time bzip2 --test --verbose "${_IFL}"
enwiki-latest-pages-articles.xml.bz2: ok
real 143m31.399s
user 128m4.964s
sys 1m9.108s
$ time bzip2 --decompress --verbose --keep "${_IFL}"
enwiki-latest-pages-articles.xml.bz2: done
real 147m59.737s
user 124m31.476s
sys 8m1.516s
$
On 10/13/20, Albretch Mueller <[email protected]> wrote:
> As part of my corpora research work I have to work with such large
> text files. Wikipedia dumps are bzip2 so I have been working with:
>
> commons/compress/compressors/bzip2/BZip2CompressorInputStream.html
>
> and I consistently notice that it just stops processing without an
> error of any kind.
>
> I checked the file at the offset where it stops and I also checked
> the file with the Linuz bzip2 utility and nothing seems to be wrong in
> any way. The source file I used is:
>
> enwiki-20141008-pages-articles.xml.bz2
>
> which you can get from:
>
> http://torrentz.pl/search?f=articles%20enwiki&safe=0
>
> I am using exactly the code example you had on your user guide:
>
> commons-compress/commons-compress_User Guide.html
>
>
> aBZ2IFl = IFl.getCanonicalPath();
>
> File OFl = new File(aOFlNm);
> aOFlNm = OFl.getCanonicalPath();
> // __
> InputStream NwIS = Files.newInputStream(Paths.get(aBZ2IFl));
> BufferedInputStream BIS = new BufferedInputStream(NwIS);
> BZip2CompressorInputStream bz2IS = new BZip2CompressorInputStream(BIS);
>
> OutputStream NwOS = Files.newOutputStream(Paths.get(aOFlNm));
> int n = 0;
> while (-1 != (n = bz2IS.read(bArBfr))) { NwOS.write(bArBfr, 0, n);
> lTtlByts += n; }
> NwOS.close();
> bz2IS.close();
>
> but it stops abruptly:
>
> // __ aOFlNm:
> |enwiki-20141008-pages-articles-multistream_20201012174009.440.xml|
> // __ |2601| total bytes compressed into |12081280894| processed in
> |2586| (ms), |1| (bytes/ms)
>
> real 0m2.955s
> user 0m2.996s
> sys 0m0.176s
>
> ~
> _OFL="enwiki-20141008-pages-articles-multistream_20201012174009.440.xml"
>
> ls -l "${_OFL}"
> wc -l "${_OFL}"
>
> $ _OFL="enwiki-20141008-pages-articles-multistream_20201012174009.440.xml"
>
> $ ls -l "${_OFL}"
> -r--r--r-- 1 lbrtchx lbrtchx 2601 Oct 12 17:40
> enwiki-20141008-pages-articles-multistream_20201012174009.440.xml
>
> $ wc -l "${_OFL}"
> 41 enwiki-20141008-pages-articles-multistream_20201012174009.440.xml
>
> $ md5sum --text "${_OFL}"
> 75c87a6650433b5cea4fef0bdae1cc1f
> enwiki-20141008-pages-articles-multistream_20201012174009.440.xml
>
> $ sha1sum --text "${_OFL}"
> 2799934309372685af919c17798e78c1796637ef
> enwiki-20141008-pages-articles-multistream_20201012174009.440.xml
>
> $ file --brief "${_OFL}"
> ASCII text
>
> $
>
> // __ originally downloaded file checked and decompressed using Linux
> bzip2 Version 1.0.6, 6-Sept-2010:
>
> $ which bzip2
> /bin/bzip2
>
> _BZ2="bzip2_--version.txt"
> bzip2 --version > "${_BZ2}" 2>&1
> cat "${_BZ2}" | head -n 1
> rm -f "${_BZ2}"
>
> $ _BZ2="bzip2_--version.txt"
> $ bzip2 --version > "${_BZ2}" 2>&1
> $ cat "${_BZ2}" | head -n 1
> bzip2, a block-sorting file compressor. Version 1.0.6, 6-Sept-2010.
> $ rm -f "${_BZ2}"
>
> // __ "testing" bz2 file
>
> $ _IFL="enwiki-20141008-pages-articles-multistream.xml.bz2"
>
> $ time bzip2 --test --verbose "${_IFL}"
> enwiki-20141008-pages-articles-multistream.xml.bz2: ok
>
> real 93m51.202s
> user 92m31.600s
> sys 0m35.188s
>
> // __ decompressing bz2 file
>
> $ time bzip2 --decompress --verbose --keep "${_IFL}"
> enwiki-20141008-pages-articles-multistream.xml.bz2: done
>
> real 129m39.665s
> user 108m15.368s
> sys 7m18.684s
> $
>
> // __ decompressed file
>
> _IFL="enwiki-20141008-pages-articles-multistream.xml"
> ls -l "${_IFL}"
> time wc -l "${_IFL}"
> time md5sum --text "${_IFL}"
> time sha1sum --text "${_IFL}"
> file --brief "${_IFL}"
>
> $ _IFL="enwiki-20141008-pages-articles-multistream.xml"
>
> $ ls -l "${_IFL}"
> -r--r--r-- 1 lbrtchx lbrtchx 50151236957 Oct 22 2014
> enwiki-20141008-pages-articles-multistream.xml
>
> $ time wc -l "${_IFL}"
> 800855553 enwiki-20141008-pages-articles-multistream.xml
>
> real 26m13.664s
> user 1m3.308s
> sys 1m30.616s
>
> $ time md5sum --text "${_IFL}"
> 1cfabd688427728794e7ae75dc93e84c
> enwiki-20141008-pages-articles-multistream.xml
>
> real 27m39.208s
> user 4m14.884s
> sys 1m33.788s
>
> $ time sha1sum --text "${_IFL}"
> e337572c1957a5a4d7625e3180e16f20e77749b1
> enwiki-20141008-pages-articles-multistream.xml
>
> real 30m40.383s
> user 8m39.852s
> sys 1m32.864s
>
> $ file --brief "${_IFL}"
> HTML document, UTF-8 Unicode text, with very long lines
> $
>
> // __ file decompressed using common compress bz2 (decompressing worked
> fine!)
>
> _IFL="enwiki-latest-pages-articles_20201013002000.103.xml"
> ls -l "${_IFL}"
> time wc -l "${_IFL}"
> time md5sum --text "${_IFL}"
> time sha1sum --text "${_IFL}"
> file --brief "${_IFL}"
>
> $ _IFL="enwiki-latest-pages-articles_20201013002000.103.xml"
>
> $ ls -l "${_IFL}"
> -rw-r--r-- 1 lbrtchx lbrtchx 50151236957 Oct 13 03:35
> enwiki-latest-pages-articles_20201013002000.103.xml
>
> $ time wc -l "${_IFL}"
> 800855553 enwiki-latest-pages-articles_20201013002000.103.xml
>
> real 14m44.535s
> user 3m55.816s
> sys 1m22.816s
>
> $ time md5sum --text "${_IFL}"
> 1cfabd688427728794e7ae75dc93e84c
> enwiki-latest-pages-articles_20201013002000.103.xml
>
> real 16m14.680s
> user 3m19.256s
> sys 1m30.488s
>
> $ time sha1sum --text "${_IFL}"
> e337572c1957a5a4d7625e3180e16f20e77749b1
> enwiki-latest-pages-articles_20201013002000.103.xml
>
> real 17m45.103s
> user 7m29.988s
> sys 1m29.540s
>
> $ file --brief "${_IFL}"
> HTML document, UTF-8 Unicode text, with very long lines
>
> $
>
>
> // __ file decompressed using common compress bz2 (decompressing
> somehow abruptly stopped)
>
> _OFL="enwiki-20141008-pages-articles-multistream_20201012174009.440.xml"
>
> $ ls -l "enwiki-20141008-pages-articles-multistream_20201012174009.440.xml"
> -r--r--r-- 1 lbrtchx lbrtchx 2601 Oct 12 17:40
> enwiki-20141008-pages-articles-multistream_20201012174009.440.xml
>
> $ wc -l "enwiki-20141008-pages-articles-multistream_20201012174009.440.xml"
> 41 enwiki-20141008-pages-articles-multistream_20201012174009.440.xml
>
> $ cat "enwiki-20141008-pages-articles-multistream_20201012174009.440.xml"
> <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.9/"
> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
> xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.9/
> http://www.mediawiki.org/xml/export-0.9.xsd" version="0.9"
> xml:lang="en">
> <siteinfo>
> <sitename>Wikipedia</sitename>
> <dbname>enwiki</dbname>
> <base>http://en.wikipedia.org/wiki/Main_Page</base>
> <generator>MediaWiki 1.25wmf1</generator>
> <case>first-letter</case>
> <namespaces>
> <namespace key="-2" case="first-letter">Media</namespace>
> <namespace key="-1" case="first-letter">Special</namespace>
> <namespace key="0" case="first-letter" />
> <namespace key="1" case="first-letter">Talk</namespace>
> <namespace key="2" case="first-letter">User</namespace>
> <namespace key="3" case="first-letter">User talk</namespace>
> <namespace key="4" case="first-letter">Wikipedia</namespace>
> <namespace key="5" case="first-letter">Wikipedia talk</namespace>
> <namespace key="6" case="first-letter">File</namespace>
> <namespace key="7" case="first-letter">File talk</namespace>
> <namespace key="8" case="first-letter">MediaWiki</namespace>
> <namespace key="9" case="first-letter">MediaWiki talk</namespace>
> <namespace key="10" case="first-letter">Template</namespace>
> <namespace key="11" case="first-letter">Template talk</namespace>
> <namespace key="12" case="first-letter">Help</namespace>
> <namespace key="13" case="first-letter">Help talk</namespace>
> <namespace key="14" case="first-letter">Category</namespace>
> <namespace key="15" case="first-letter">Category talk</namespace>
> <namespace key="100" case="first-letter">Portal</namespace>
> <namespace key="101" case="first-letter">Portal talk</namespace>
> <namespace key="108" case="first-letter">Book</namespace>
> <namespace key="109" case="first-letter">Book talk</namespace>
> <namespace key="118" case="first-letter">Draft</namespace>
> <namespace key="119" case="first-letter">Draft talk</namespace>
> <namespace key="446" case="first-letter">Education
> Program</namespace>
> <namespace key="447" case="first-letter">Education Program
> talk</namespace>
> <namespace key="710" case="first-letter">TimedText</namespace>
> <namespace key="711" case="first-letter">TimedText talk</namespace>
> <namespace key="828" case="first-letter">Module</namespace>
> <namespace key="829" case="first-letter">Module talk</namespace>
> <namespace key="2600" case="first-letter">Topic</namespace>
> </namespaces>
> </siteinfo>
> $
>
> // __ first 45 lines of decompressed file using Linux bzip2
>
> _IFL="enwiki-20141008-pages-articles-multistream.xml"
>
> head -n 45 "${_IFL}"
>
> $ head -n 45 "${_IFL}"
> <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.9/"
> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
> xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.9/
> http://www.mediawiki.org/xml/export-0.9.xsd" version="0.9"
> xml:lang="en">
> <siteinfo>
> <sitename>Wikipedia</sitename>
> <dbname>enwiki</dbname>
> <base>http://en.wikipedia.org/wiki/Main_Page</base>
> <generator>MediaWiki 1.25wmf1</generator>
> <case>first-letter</case>
> <namespaces>
> <namespace key="-2" case="first-letter">Media</namespace>
> <namespace key="-1" case="first-letter">Special</namespace>
> <namespace key="0" case="first-letter" />
> <namespace key="1" case="first-letter">Talk</namespace>
> <namespace key="2" case="first-letter">User</namespace>
> <namespace key="3" case="first-letter">User talk</namespace>
> <namespace key="4" case="first-letter">Wikipedia</namespace>
> <namespace key="5" case="first-letter">Wikipedia talk</namespace>
> <namespace key="6" case="first-letter">File</namespace>
> <namespace key="7" case="first-letter">File talk</namespace>
> <namespace key="8" case="first-letter">MediaWiki</namespace>
> <namespace key="9" case="first-letter">MediaWiki talk</namespace>
> <namespace key="10" case="first-letter">Template</namespace>
> <namespace key="11" case="first-letter">Template talk</namespace>
> <namespace key="12" case="first-letter">Help</namespace>
> <namespace key="13" case="first-letter">Help talk</namespace>
> <namespace key="14" case="first-letter">Category</namespace>
> <namespace key="15" case="first-letter">Category talk</namespace>
> <namespace key="100" case="first-letter">Portal</namespace>
> <namespace key="101" case="first-letter">Portal talk</namespace>
> <namespace key="108" case="first-letter">Book</namespace>
> <namespace key="109" case="first-letter">Book talk</namespace>
> <namespace key="118" case="first-letter">Draft</namespace>
> <namespace key="119" case="first-letter">Draft talk</namespace>
> <namespace key="446" case="first-letter">Education
> Program</namespace>
> <namespace key="447" case="first-letter">Education Program
> talk</namespace>
> <namespace key="710" case="first-letter">TimedText</namespace>
> <namespace key="711" case="first-letter">TimedText talk</namespace>
> <namespace key="828" case="first-letter">Module</namespace>
> <namespace key="829" case="first-letter">Module talk</namespace>
> <namespace key="2600" case="first-letter">Topic</namespace>
> </namespaces>
> </siteinfo>
> <page>
> <title>AccessibleComputing</title>
> <ns>0</ns>
> <id>10</id>
> $
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]