[jira] [Commented] (PDFBOX-3058) Support TIKA Migration to PDFBox 2.0

Tim Allison (JIRA) Mon, 26 Oct 2015 04:04:09 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974025#comment-14974025
 ]


Tim Allison commented on PDFBOX-3058:
-------------------------------------

Thank you, [~msahyoun] for opening this.

[~tilman], wow.  Thank you.

>From Tilman on the [dev 
>list|http://mail-archives.apache.org/mod_mbox/pdfbox-dev/201510.mbox/%3C562BF2F5.2050203%40t-online.de%3E]
{quote}1)
re file commoncrawl2/NZ/NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU:

Please test your code why the word "microsoft" is missing. This is in the 
/Title:
18 0 obj
(Microsoft Word - Water Line Pipe Sizing.docx.doc) endobj
{quote}

The content comparison code is only currently looking at the content of the 
docs, not the metadata.  The only metadata comparison currently available is 
the comparison of counts of metadata values.  If we want to add comparison of 
metadata content (and I really would)...should we concatenate all metadata 
values into one string and then run the current comparison methods?  Dump 
results to a metadata_content_diffs.csv table?  Or is there a better way to 
compare metadata values?

{quote}
2)
Could you please rerun the test with the latest trunk, preferably with 
the same test set? One of the bugs I fixed (PDFBOX-3053) applies to many 
files. So now I have the problem that "problem" files I test manually no 
longer miss the tokens mentioned in the report.
{quote}

Happy to rerun... ready to go?  Or should I wait a bit?

> Support TIKA Migration to PDFBox 2.0
> ------------------------------------
>
>                 Key: PDFBOX-3058
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3058
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>            Reporter: Maruan Sahyoun
>         Attachments: content_diffs-1.8-to-2.0.xlsx
>
>
> This issue is to track fixing issues which came up as part of TIKA-1285 
> (Upgrade to PDFBox 2.0.0 when available) mainly
> - new exceptions compared to PDFBox 1.8.x
> - regressions in text extraction
> - lower quality text extraction
> There should be individual issues to track tasks/bugs arising from that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-3058) Support TIKA Migration to PDFBox 2.0

Reply via email to