I tried to substitute
fedora/tomcat/webapps/fedorqagsearch/WEB-INF/lib/PDFBox-0.7.2.jar
with pdfbox-1.1.0.jar
GSearch did not work with it. Tomcat gave output below. Anyway, it
would be nice to make new pdfbox work somehow.

--Serhiy


javax.servlet.ServletException: Servlet execution threw an exception

root cause

java.lang.NoClassDefFoundError: org/pdfbox/exceptions/InvalidPasswordException
        dk.defxws.fedoragsearch.server.Config.checkMimeTypes(Config.java:724)
        dk.defxws.fedoragsearch.server.Config.checkConfig(Config.java:275)
        dk.defxws.fedoragsearch.server.Config.<init>(Config.java:232)
        dk.defxws.fedoragsearch.server.Config.getCurrentConfig(Config.java:133)
        dk.defxws.fedoragsearch.server.RESTImpl.doGet(RESTImpl.java:85)
        javax.servlet.http.HttpServlet.service(HttpServlet.java:617)
        javax.servlet.http.HttpServlet.service(HttpServlet.java:717)

root cause

java.lang.ClassNotFoundException: org.pdfbox.exceptions.InvalidPasswordException
        
org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1387)
        
org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1233)
        java.lang.ClassLoader.loadClassInternal(ClassLoader.java:316)
        dk.defxws.fedoragsearch.server.Config.checkMimeTypes(Config.java:724)
        dk.defxws.fedoragsearch.server.Config.checkConfig(Config.java:275)
        dk.defxws.fedoragsearch.server.Config.<init>(Config.java:232)
        dk.defxws.fedoragsearch.server.Config.getCurrentConfig(Config.java:133)
        dk.defxws.fedoragsearch.server.RESTImpl.doGet(RESTImpl.java:85)
        javax.servlet.http.HttpServlet.service(HttpServlet.java:617)
        javax.servlet.http.HttpServlet.service(HttpServlet.java:717)




On Thu, Apr 15, 2010 at 2:11 AM, Gert Schmeltz Pedersen <[email protected]> 
wrote:
> GSearch currently (version 2.2) includes PDFBox-0.7.2.jar. The
> problems with these characters presumably started with new pdf tools
> that include these characters, that is why only a smaller part of your
> pdf documents have them. The newest PDFBox is 1.1.0 from 
> http://pdfbox.apache.org/download.html
> , it will be included in GSearch soon, but you may try it yourself, I
> would be interested to know, whether a simple substitution to that new
> jar will solve the problem.
>
> Best
> Gert
>
>
> On 14/04/2010, at 23.23, Ben Ranker wrote:
>
>> Quoth Serhiy Polyakov on Wed, Apr 14, 2010 at 03:00:46PM -0400:
>>> I use FedoraGSearch to index and have the same problem for about 20%
>>> of PDFs in my collection. Log has the following:
>>>
>>> org.xml.sax.SAXParseException: Character reference "&#24" in an
>>> invalid XML character
>>>
>>> The same for "&#0"
>>
>> XML 1.0 does not allow &#24; — or in fact most characters from &#0;
>> to &#31;
>> with a few exceptions — to appear within a document. XML 1.1 allows
>> &#1; and
>> up but still does not allow &#0;. I’m afraid I don’t have a practical
>> solution handy, but perhaps a suggestion might point you in the right
>> direction: If you’re able to convince gsearch and solr to
>> communicate using
>> XML 1.1, it should solve your &#24; problem.
>>
>> It won’t solve the &#0; problem, though, since that’s still
>> disallowed in
>> XML 1.1. I recommend stripping those out. In fairness, there’s
>> probably not
>> a good reason for PDF text to contain any of these control
>> characters in the
>> first place, so your efforts may be better focused on stripping out
>> the
>> control characters than worrying about XML 1.1 at all.
>>
>> One of our developers has dealt with the problem of stripping &#0;
>> from PDF
>> output before. Unfortunately she’s on vacation at the moment, so I
>> can’t
>> just bop over to her cube and ask her. I *think* she wrote a special
>> dissemination for the purpose of getting the PDF text and stripping
>> out
>> unwanted characters, but I’m not certain of that. I’ll send her an
>> email, in
>> any case, and see if she can offer some advice when she gets back.
>>
>> --
>> Ben Ranker <[email protected]>
>> Software Engineer, Sr.
>> Emory University Libraries
>> <signature.asc><ATT00001..c><ATT00002..c>
>
>

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Fedora-commons-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users

Reply via email to