[jira] [Commented] (TIKA-2750) Update regression corpus

floyd (JIRA) Tue, 06 Nov 2018 06:17:56 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16676812#comment-16676812
 ]


floyd commented on TIKA-2750:
-----------------------------

I ran another test on the regression VM in the last 4 days. I tried to see how 
long it would take to narrow down /data1/docs/commoncrawl3/ with 6 worker 
threads (using nearly 100% CPU on the regression VM):


{code:java}
$ /data1/fuzzing/tools/afl-kit/afl-cmin.py --no-dedup -i 
'/data1/docs/commoncrawl3/*/' -o /data1/fuzzing/tika-corpus-data1-docs-cmined/ 
-m none -t 30000 -w 6 /data1/fuzzing/tools/jqf-zip/bin/jqf-afl-target 
edu.berkeley.cs.jqf.examples.tika.TikaParserTest fuzz @@
Hint: install python module "tqdm" to show progress bar
2018-11-02 17:06:34,070 - INFO - Found 819070 input files in 1024 directories
2018-11-02 17:06:34,071 - INFO - Skipping file deduplication.
2018-11-02 17:06:34,071 - INFO - Sorting files.
2018-11-02 17:06:44,474 - INFO - Testing the target binary
2018-11-02 17:06:52,606 - INFO - ok, 2729 tuples recorded
2018-11-02 17:06:52,689 - INFO - Obtaining trace results{code}

However, it seems that afl-cmin.py is not able to create traces faster than 300 
files per hour. After around 4 days I was still nowhere:


{code:java}
$ ls -1 ./tika-corpus-data1-docs-cmined/.traces/ |wc -l
28505{code}

As that commoncrawl folder had 819'0070 files, that would take over 4 months 
and then only the sorting and finding the best candidates process would 
start... and then if the data is too big for an operation (e.g. not enough RAM 
or disc), it probably all fails and the run was useless.

So maybe it would be better to do some manual cleanup first (e.g. remove 
ASCII-only files) and then do several runs on smaller parts of the entire 
corpus.

> Update regression corpus
> ------------------------
>
>                 Key: TIKA-2750
>                 URL: https://issues.apache.org/jira/browse/TIKA-2750
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>         Attachments: CC-MAIN-2018-39-charset_lang_by_tld.zip, 
> CC-MAIN-2018-39-mimes-charsets-by-tld.zip, 
> CC-MAIN-2018-39-mimes-v-detected.zip
>
>
> I think we've had great success with the current data on our regression 
> corpus.  I'd like to re-fresh some data from common crawl with three primary 
> goals:
> 1) include more interesting documents (e.g. down sample English UTF-8 
> text/html)
> 2) include more recent documents (perhaps newer features in PDFs? definitely 
> more ooxml)
> 3) identify and re-fetch truncated documents from the original site -- 
> CommonCrawl truncates docs at 1 MB.  I think some truncated documents have 
> been quite useful, similar to fuzzing, for identifying serious problems with 
> some of our parsers.  However, it would be useful to have more complete 
> files, esp. for PDFs.  In short, we should keep some truncated documents, but 
> I'd also like to get more complete docs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2750) Update regression corpus

Reply via email to