[ 
https://issues.apache.org/jira/browse/PDFBOX-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974044#comment-14974044
 ] 

Tim Allison commented on PDFBOX-3058:
-------------------------------------

Should have included these other caveats in my original email:
#. Many of the Common Crawl docs are truncated at ~ 1MB.  This is (was?) by 
design/intention of Common Crawl to limit the amount of data gathered.  
Somewhere on one of my todo lists is to repull the truncated docs so that we 
have the full docs for file types that are typically larger than 1 MB, 
including PDFs.
#. I haven't run any virus scans on the Common Crawl docs...that's also 
somewhere on a todo list.  Recommendations for packages (clamav?) or volunteers 
to run the scan would be appreciated?
#. As before, there may be some ghosts of multi-threading + caching of fonts in 
the 1.8.10 results.  That is, you might not be able to replicate the same 
extracted text that was pulled in our run because of differences in file 
processing order or the differences of what was cached between single and 
multi-threading. 

Questions for ease of use of the reports
#.  I've been meaning to include links to the json files that contain the 
extracted text+metadata per run.  Would these be of use?
#.  Should I append the detected file type's suffix (.pdf for y'all) for the 
Common Crawl documents?  The initial method was to take whatever I could find 
from the original url, which leads to quite a bit of noise (ashx, etc.) and 
quite a few files with no file suffix.
#.  Anything else in the reports that would be of use?  I'm going to add a 
"lost content" report that gives the basic content summary statistics in A when 
for a given file pair, there is a new exception in B -- this might help 
prioritize fixing files that initially had good content and allow you to 
quickly skip files that had junk extracted in A but an exception in B.

> Support TIKA Migration to PDFBox 2.0
> ------------------------------------
>
>                 Key: PDFBOX-3058
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3058
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>            Reporter: Maruan Sahyoun
>         Attachments: content_diffs-1.8-to-2.0.xlsx
>
>
> This issue is to track fixing issues which came up as part of TIKA-1285 
> (Upgrade to PDFBox 2.0.0 when available) mainly
> - new exceptions compared to PDFBox 1.8.x
> - regressions in text extraction
> - lower quality text extraction
> There should be individual issues to track tasks/bugs arising from that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to