There are a number of different versions of PDF and a number of 
applications that generate PDFs. Some combinations of version and 
application generate PDFs that are subtly misunderstood by some 
applications that read PDFs.

I suggest that you try to narrow down which application was used to 
generate the PDFs you're having difficulty with.

If you can isolate a set of versions and applications that give you 
trouble you can then open and re-save the PDFs in a tool that doesn't 
have the problem. This can potentially be automated too, if you have 
many PDFs.

We have found, for example, that PDFCreator (the windows-based PDF 
program that works like a print-driver) strips out the full-text when 
used to concatenate documents together. Once we discovered this it was a 
relatively simple matter to adjust our workflow to compensate for the 
problem and catch the few bad PDFs that had already made it through into 
the collection.

cheers
stuart



Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote:
> I found out something very interesting this weekend.  I took a .pdf file 
> that was "unfilterable"; in other words filter-media displayed an error 
> like this:
> 
>  "ERROR filtering, skipping bitstream #21220 java.io.IOException: Error: 
> value is not an integer type actual='--20'"
> 
>  
> 
> On a hunch, I looked at the document and found it had several pages of 
> graphics/images in it.  I deleted all pages in the document, which 
> contained images and guess what?  It filtered just fine.
> 
>  
> 
> Hmmm…we have to be able to upload documents that contain images.  NASA 
> has a LOT of images in their documents.  Now what??
> 
>  
> 
> Sue Walker-Thornton
> 
> NASA Langley Research Center
> 
> (757) 224-4074
> 
>  
> 
> -----Original Message-----
> From: Graham Triggs [mailto:[EMAIL PROTECTED]
> Sent: Friday, October 24, 2008 3:13 PM
> To: [email protected]
> Subject: Re: [Dspace-tech] filter-media problem - question on size limit
> 
>  
> 
> If anyone has example PDFs that cause the text extraction to fail
> 
> (smaller PDFs preferably!) that they are able to share, please send them
> 
> - or a link to retrieve them - to me.
> 
>  
> 
> Thanks,
> 
> G
> 
>  
> 
> Mark H. Wood wrote:
> 
>>  I found this:
> 
>> 
> 
>>    http://java-source.net/open-source/pdf-libraries
> 
>> 
> 
>>  PJX and PDF Jester look, at first glance, as though they might be
> 
>>  worth considering.
> 
>> 
> 
>>  OTOH it looks like PDFBox might be getting more attention in its new
> 
>>  home, and if so, then it makes sense to stick with it and help to
> 
>>  improve it.
> 
>> 
> 
>> 
> 
>> 
> 
>>  ------------------------------------------------------------------------
> 
>> 
> 
>>  -------------------------------------------------------------------------
> 
>>  This SF.Net email is sponsored by the Moblin Your Move Developer's 
> challenge
> 
>>  Build the coolest Linux based applications with Moblin SDK&  win great 
> prizes
> 
>>  Grand prize is a trip for two to an Open Source event anywhere in the 
> world
> 
>>  http://moblin-contest.org/redirect.php?banner_id=100&url=/
> 
>> 
> 
>> 
> 
>>  ------------------------------------------------------------------------
> 
>> 
> 
>>  _______________________________________________
> 
>>  DSpace-tech mailing list
> 
>>  [email protected]
> 
>>  https://lists.sourceforge.net/lists/listinfo/dspace-tech
> 
>  
> 
> This email has been scanned by Postini.
> 
> For more information please visit http://www.postini.com
> 
>  
> 
>  
> 
> -------------------------------------------------------------------------
> 
> This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
> 
> Build the coolest Linux based applications with Moblin SDK & win great 
> prizes
> 
> Grand prize is a trip for two to an Open Source event anywhere in the world
> 
> http://moblin-contest.org/redirect.php?banner_id=100&url=/
> 
> _______________________________________________
> 
> DSpace-tech mailing list
> 
> [email protected]
> 
> https://lists.sourceforge.net/lists/listinfo/dspace-tech
> 
> 
> ------------------------------------------------------------------------
> 
> -------------------------------------------------------------------------
> This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
> Build the coolest Linux based applications with Moblin SDK & win great prizes
> Grand prize is a trip for two to an Open Source event anywhere in the world
> http://moblin-contest.org/redirect.php?banner_id=100&url=/
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> DSpace-tech mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dspace-tech


-- 
Stuart Yeates
Te Pātaka Kōrero o Te Whare Wānanga o te Ūpoko o te Ika a Māui
http://www.nzetc.org/       New Zealand Electronic Text Centre
http://researcharchive.vuw.ac.nz/     Institutional Repository


-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to