$ _IFL="enwiki-20141008-pages-articles-multistream.xml.bz2"

$ time bzip2 --test --verbose "${_IFL}"
  enwiki-20141008-pages-articles-multistream.xml.bz2: ok

real    93m51.202s
user    92m31.600s
sys     0m35.188s

$ time bzip2 --decompress --verbose --keep "${_IFL}"
  enwiki-20141008-pages-articles-multistream.xml.bz2: done

real    129m39.665s
user    108m15.368s
sys     7m18.684s

$ _IFL="enwiki-latest-pages-articles.xml.bz2"

$ time bzip2 --test --verbose "${_IFL}"
  enwiki-latest-pages-articles.xml.bz2: ok

real    143m31.399s
user    128m4.964s
sys     1m9.108s

$ time bzip2 --decompress --verbose --keep "${_IFL}"
  enwiki-latest-pages-articles.xml.bz2: done

real    147m59.737s
user    124m31.476s
sys     8m1.516s
$


On 10/13/20, Albretch Mueller <lbrt...@gmail.com> wrote:
>  As part of my corpora research work I have to work with such large
> text files. Wikipedia dumps are bzip2 so I have been working with:
>
>  commons/compress/compressors/bzip2/BZip2CompressorInputStream.html
>
>  and I consistently notice that it just stops processing without an
> error of any kind.
>
>  I checked the file at the offset where it stops and I also checked
> the file with the Linuz bzip2 utility and nothing seems to be wrong in
> any way. The source file I used is:
>
>  enwiki-20141008-pages-articles.xml.bz2
>
>  which you can get from:
>
>  http://torrentz.pl/search?f=articles%20enwiki&safe=0
>
>  I am using exactly the code example you had on your user guide:
>
>  commons-compress/commons-compress_User Guide.html
>
>
>     aBZ2IFl = IFl.getCanonicalPath();
>
>     File OFl = new File(aOFlNm);
>     aOFlNm = OFl.getCanonicalPath();
> // __
>     InputStream NwIS = Files.newInputStream(Paths.get(aBZ2IFl));
>     BufferedInputStream BIS = new BufferedInputStream(NwIS);
>     BZip2CompressorInputStream bz2IS = new BZip2CompressorInputStream(BIS);
>
>     OutputStream NwOS = Files.newOutputStream(Paths.get(aOFlNm));
>     int n = 0;
>     while (-1 != (n = bz2IS.read(bArBfr))) { NwOS.write(bArBfr, 0, n);
>  lTtlByts += n; }
>     NwOS.close();
>     bz2IS.close();
>
>  but it stops abruptly:
>
> // __ aOFlNm:
> |enwiki-20141008-pages-articles-multistream_20201012174009.440.xml|
> // __ |2601| total bytes compressed into |12081280894| processed in
> |2586| (ms), |1| (bytes/ms)
>
> real  0m2.955s
> user  0m2.996s
> sys   0m0.176s
>
> ~
> _OFL="enwiki-20141008-pages-articles-multistream_20201012174009.440.xml"
>
> ls -l "${_OFL}"
> wc -l "${_OFL}"
>
> $ _OFL="enwiki-20141008-pages-articles-multistream_20201012174009.440.xml"
>
> $ ls -l "${_OFL}"
> -r--r--r-- 1 lbrtchx lbrtchx 2601 Oct 12 17:40
> enwiki-20141008-pages-articles-multistream_20201012174009.440.xml
>
> $ wc -l "${_OFL}"
> 41 enwiki-20141008-pages-articles-multistream_20201012174009.440.xml
>
> $ md5sum --text "${_OFL}"
> 75c87a6650433b5cea4fef0bdae1cc1f
> enwiki-20141008-pages-articles-multistream_20201012174009.440.xml
>
> $ sha1sum --text "${_OFL}"
> 2799934309372685af919c17798e78c1796637ef
> enwiki-20141008-pages-articles-multistream_20201012174009.440.xml
>
> $ file --brief  "${_OFL}"
> ASCII text
>
> $
>
> // __ originally downloaded file checked and decompressed using Linux
> bzip2 Version 1.0.6, 6-Sept-2010:
>
> $ which bzip2
> /bin/bzip2
>
> _BZ2="bzip2_--version.txt"
> bzip2 --version > "${_BZ2}" 2>&1
> cat "${_BZ2}" | head -n 1
> rm -f "${_BZ2}"
>
> $ _BZ2="bzip2_--version.txt"
> $ bzip2 --version > "${_BZ2}" 2>&1
> $ cat "${_BZ2}" | head -n 1
> bzip2, a block-sorting file compressor.  Version 1.0.6, 6-Sept-2010.
> $ rm -f "${_BZ2}"
>
> // __ "testing" bz2 file
>
> $ _IFL="enwiki-20141008-pages-articles-multistream.xml.bz2"
>
> $ time bzip2 --test --verbose "${_IFL}"
>   enwiki-20141008-pages-articles-multistream.xml.bz2: ok
>
> real    93m51.202s
> user    92m31.600s
> sys     0m35.188s
>
> // __ decompressing bz2 file
>
> $ time bzip2 --decompress --verbose --keep "${_IFL}"
>   enwiki-20141008-pages-articles-multistream.xml.bz2: done
>
> real    129m39.665s
> user    108m15.368s
> sys     7m18.684s
> $
>
> // __ decompressed file
>
> _IFL="enwiki-20141008-pages-articles-multistream.xml"
> ls -l "${_IFL}"
> time wc -l "${_IFL}"
> time md5sum --text "${_IFL}"
> time sha1sum --text "${_IFL}"
> file --brief  "${_IFL}"
>
> $ _IFL="enwiki-20141008-pages-articles-multistream.xml"
>
> $ ls -l "${_IFL}"
> -r--r--r-- 1 lbrtchx lbrtchx 50151236957 Oct 22  2014
> enwiki-20141008-pages-articles-multistream.xml
>
> $ time wc -l "${_IFL}"
> 800855553 enwiki-20141008-pages-articles-multistream.xml
>
> real    26m13.664s
> user    1m3.308s
> sys     1m30.616s
>
> $ time md5sum --text "${_IFL}"
> 1cfabd688427728794e7ae75dc93e84c
> enwiki-20141008-pages-articles-multistream.xml
>
> real    27m39.208s
> user    4m14.884s
> sys     1m33.788s
>
> $ time sha1sum --text "${_IFL}"
> e337572c1957a5a4d7625e3180e16f20e77749b1
> enwiki-20141008-pages-articles-multistream.xml
>
> real    30m40.383s
> user    8m39.852s
> sys     1m32.864s
>
> $ file --brief  "${_IFL}"
> HTML document, UTF-8 Unicode text, with very long lines
> $
>
> // __ file decompressed using common compress bz2 (decompressing worked
> fine!)
>
> _IFL="enwiki-latest-pages-articles_20201013002000.103.xml"
> ls -l "${_IFL}"
> time wc -l "${_IFL}"
> time md5sum --text "${_IFL}"
> time sha1sum --text "${_IFL}"
> file --brief  "${_IFL}"
>
> $ _IFL="enwiki-latest-pages-articles_20201013002000.103.xml"
>
> $ ls -l "${_IFL}"
> -rw-r--r-- 1 lbrtchx lbrtchx 50151236957 Oct 13 03:35
> enwiki-latest-pages-articles_20201013002000.103.xml
>
> $ time wc -l "${_IFL}"
> 800855553 enwiki-latest-pages-articles_20201013002000.103.xml
>
> real    14m44.535s
> user    3m55.816s
> sys     1m22.816s
>
> $ time md5sum --text "${_IFL}"
> 1cfabd688427728794e7ae75dc93e84c
> enwiki-latest-pages-articles_20201013002000.103.xml
>
> real    16m14.680s
> user    3m19.256s
> sys     1m30.488s
>
> $ time sha1sum --text "${_IFL}"
> e337572c1957a5a4d7625e3180e16f20e77749b1
> enwiki-latest-pages-articles_20201013002000.103.xml
>
> real    17m45.103s
> user    7m29.988s
> sys     1m29.540s
>
> $ file --brief  "${_IFL}"
> HTML document, UTF-8 Unicode text, with very long lines
>
> $
>
>
> // __ file decompressed using common compress bz2 (decompressing
> somehow abruptly stopped)
>
> _OFL="enwiki-20141008-pages-articles-multistream_20201012174009.440.xml"
>
> $ ls -l "enwiki-20141008-pages-articles-multistream_20201012174009.440.xml"
> -r--r--r-- 1 lbrtchx lbrtchx 2601 Oct 12 17:40
> enwiki-20141008-pages-articles-multistream_20201012174009.440.xml
>
> $ wc -l "enwiki-20141008-pages-articles-multistream_20201012174009.440.xml"
> 41 enwiki-20141008-pages-articles-multistream_20201012174009.440.xml
>
> $ cat "enwiki-20141008-pages-articles-multistream_20201012174009.440.xml"
> <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.9/";
> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
> xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.9/
> http://www.mediawiki.org/xml/export-0.9.xsd"; version="0.9"
> xml:lang="en">
>   <siteinfo>
>     <sitename>Wikipedia</sitename>
>     <dbname>enwiki</dbname>
>     <base>http://en.wikipedia.org/wiki/Main_Page</base>
>     <generator>MediaWiki 1.25wmf1</generator>
>     <case>first-letter</case>
>     <namespaces>
>       <namespace key="-2" case="first-letter">Media</namespace>
>       <namespace key="-1" case="first-letter">Special</namespace>
>       <namespace key="0" case="first-letter" />
>       <namespace key="1" case="first-letter">Talk</namespace>
>       <namespace key="2" case="first-letter">User</namespace>
>       <namespace key="3" case="first-letter">User talk</namespace>
>       <namespace key="4" case="first-letter">Wikipedia</namespace>
>       <namespace key="5" case="first-letter">Wikipedia talk</namespace>
>       <namespace key="6" case="first-letter">File</namespace>
>       <namespace key="7" case="first-letter">File talk</namespace>
>       <namespace key="8" case="first-letter">MediaWiki</namespace>
>       <namespace key="9" case="first-letter">MediaWiki talk</namespace>
>       <namespace key="10" case="first-letter">Template</namespace>
>       <namespace key="11" case="first-letter">Template talk</namespace>
>       <namespace key="12" case="first-letter">Help</namespace>
>       <namespace key="13" case="first-letter">Help talk</namespace>
>       <namespace key="14" case="first-letter">Category</namespace>
>       <namespace key="15" case="first-letter">Category talk</namespace>
>       <namespace key="100" case="first-letter">Portal</namespace>
>       <namespace key="101" case="first-letter">Portal talk</namespace>
>       <namespace key="108" case="first-letter">Book</namespace>
>       <namespace key="109" case="first-letter">Book talk</namespace>
>       <namespace key="118" case="first-letter">Draft</namespace>
>       <namespace key="119" case="first-letter">Draft talk</namespace>
>       <namespace key="446" case="first-letter">Education
> Program</namespace>
>       <namespace key="447" case="first-letter">Education Program
> talk</namespace>
>       <namespace key="710" case="first-letter">TimedText</namespace>
>       <namespace key="711" case="first-letter">TimedText talk</namespace>
>       <namespace key="828" case="first-letter">Module</namespace>
>       <namespace key="829" case="first-letter">Module talk</namespace>
>       <namespace key="2600" case="first-letter">Topic</namespace>
>     </namespaces>
>   </siteinfo>
> $
>
> // __ first 45 lines of decompressed file using Linux bzip2
>
> _IFL="enwiki-20141008-pages-articles-multistream.xml"
>
> head -n 45 "${_IFL}"
>
> $ head -n 45 "${_IFL}"
> <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.9/";
> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
> xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.9/
> http://www.mediawiki.org/xml/export-0.9.xsd"; version="0.9"
> xml:lang="en">
>   <siteinfo>
>     <sitename>Wikipedia</sitename>
>     <dbname>enwiki</dbname>
>     <base>http://en.wikipedia.org/wiki/Main_Page</base>
>     <generator>MediaWiki 1.25wmf1</generator>
>     <case>first-letter</case>
>     <namespaces>
>       <namespace key="-2" case="first-letter">Media</namespace>
>       <namespace key="-1" case="first-letter">Special</namespace>
>       <namespace key="0" case="first-letter" />
>       <namespace key="1" case="first-letter">Talk</namespace>
>       <namespace key="2" case="first-letter">User</namespace>
>       <namespace key="3" case="first-letter">User talk</namespace>
>       <namespace key="4" case="first-letter">Wikipedia</namespace>
>       <namespace key="5" case="first-letter">Wikipedia talk</namespace>
>       <namespace key="6" case="first-letter">File</namespace>
>       <namespace key="7" case="first-letter">File talk</namespace>
>       <namespace key="8" case="first-letter">MediaWiki</namespace>
>       <namespace key="9" case="first-letter">MediaWiki talk</namespace>
>       <namespace key="10" case="first-letter">Template</namespace>
>       <namespace key="11" case="first-letter">Template talk</namespace>
>       <namespace key="12" case="first-letter">Help</namespace>
>       <namespace key="13" case="first-letter">Help talk</namespace>
>       <namespace key="14" case="first-letter">Category</namespace>
>       <namespace key="15" case="first-letter">Category talk</namespace>
>       <namespace key="100" case="first-letter">Portal</namespace>
>       <namespace key="101" case="first-letter">Portal talk</namespace>
>       <namespace key="108" case="first-letter">Book</namespace>
>       <namespace key="109" case="first-letter">Book talk</namespace>
>       <namespace key="118" case="first-letter">Draft</namespace>
>       <namespace key="119" case="first-letter">Draft talk</namespace>
>       <namespace key="446" case="first-letter">Education
> Program</namespace>
>       <namespace key="447" case="first-letter">Education Program
> talk</namespace>
>       <namespace key="710" case="first-letter">TimedText</namespace>
>       <namespace key="711" case="first-letter">TimedText talk</namespace>
>       <namespace key="828" case="first-letter">Module</namespace>
>       <namespace key="829" case="first-letter">Module talk</namespace>
>       <namespace key="2600" case="first-letter">Topic</namespace>
>     </namespaces>
>   </siteinfo>
>   <page>
>     <title>AccessibleComputing</title>
>     <ns>0</ns>
>     <id>10</id>
> $
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Reply via email to