Re: [fcrepo-user] Fedora GSearch and OBJ datastream (application/pdf) extraction for Solr index

Gert Schmeltz Pedersen Wed, 23 Nov 2011 01:24:47 -0800

I can confirm that the pdf document in datastream DS2 of the demo object 
demo:18 is indexed in my test installation.


If I understand you correctly, you _do_ get the pdf indexed as part of 
foxml.all.text, right? So that must mean that the error is produced somewhere 
else in your indexing stylesheet, maybe in line #86 as indicated in the error 
message below, also, it is strange that the error message refers to saxon, 
saxon cannot work, when your exts refers to xalan. Look into fedoragsearch.log 
and catalina.out, there must be something.

-Gert


On 23/11/2011, at 09.26, Serhiy Polyakov wrote:

> Hello,
> 
> I am trying to get OBJ datastream (application/pdf) processed and
> indexed into Solr 3.4 with GSearch2.3. I excluded all MODS streams to
> isolate the problem. So I have DC and OBJ (pdf)
> 
> Note: Pdf indexing was working for me in last spring installation with
> GSearch 2.2 on Lucene. Summer time system with and GSearch 2.2 beta on
> Solr 1.4 is not indexing pdf as well.
> 
> For the debugging I tried command line XSLT processor xalan 2.7.0 that
> comes with GSearch. I include all classpath vars as I mentioned in
> previous messages.
> 
> It gives this Error:
> 
> file:///home/fedora/3_XML_Pro/foxmlToSolr.xslt; Line #86; Column #-1;
> XSLT Error (net.sf.saxon.trans.XPathException): Cannot find a matching
> 8-argument function named
> {xalan://dk.defxws.fedoragsearch.server.GenericOperationsImpl}getDatastreamText()
> Exception in thread "main" java.lang.RuntimeException: Cannot find a
> matching 8-argument function named
> {xalan://dk.defxws.fedoragsearch.server.GenericOperationsImpl}getDatastreamText()
>        at org.apache.xalan.xslt.Process.doExit(Process.java:1153)
>        at org.apache.xalan.xslt.Process.main(Process.java:1126)
> 
> Only when I downloaded Xalan 2.7.1 into separate directory and added
> classpath to it in the command line I can process and get output file
> with all the fields including OBJ fulltext extracted from pdf. I tried
> to overwrite Xalan Jars that came with GSearch with new ones but it
> still gives same error. Only when I am directly running Xalan 2.7.1
> from the separate directory it is processing the input file.
> 
> ====================
> Here is excerpt from the input object's Foxml I am using to process:
> 
> <foxml:datastream ID="OBJ" FEDORA_URI="info:fedora/islandora:6/OBJ"
> STATE="A" CONTROL_GROUP="M" VERSIONABLE="true">
> <foxml:datastreamVersion ID="OBJ.0" LABEL="Title_2.pdf"
> CREATED="2011-10-19T09:07:40.379Z" MIMETYPE="application/pdf"
> SIZE="56276">
> <foxml:contentLocation TYPE="INTERNAL_ID"
> REF="http://myhost:8080/fedora/get/islandora:6/OBJ/2011-10-19T09:07:40.379Z"/>
> 
> ====================
> I am using stylesheet foxmlToSolr.xslt that came with GSearch. It has
> the following lines in header:
> 
> xmlns:exts="xalan://dk.defxws.fedoragsearch.server.GenericOperationsImpl"
> exclude-result-prefixes="exts"
> ---------------------------
> And the following in the body:
> 
> <xsl:for-each select="foxml:datastream[@CONTROL_GROUP='M' or
> @CONTROL_GROUP='E' or @CONTROL_GROUP='R']">
>    <field>
>        <xsl:attribute name="name">
>            <xsl:value-of select="concat('dsm.', @ID)"/>
>        </xsl:attribute>
>        <xsl:value-of select="exts:getDatastreamText($PID,
> $REPOSITORYNAME, @ID, $FEDORASOAP, $FEDORAUSER, $FEDORAPASS,
> $TRUSTSTOREPATH, $TRUSTSTOREPASS)"/>
>    </field>
> </xsl:for-each>
> ====================
> 
> When objects are submitted into Fedora all inline data streams are
> getting OK into the index. All non-inline (Managed) datasteams that do
> not require external processing (like ORC text) are processed OK into
> index. Non-inline datasteam OBJ containing pdf that require external
> processing are not getting into the index.
> 
> I have this package
> dk.defxws.fedoragsearch.server.GenericOperationsImpl
> 
> under
> ..tomcat/webapps/fedoragsearch/WEB-INF/classes
> 
> And it is used by GSearch for extraction of foxml.all.text. It means
> it is visible for GSearch. Sounds like it is only when GSearh passes
> pdf content of OBJ datastream for extraction it is not getting it
> back.
> 
> Could somebody confirm that objects with pdf content are fulltext
> indexed OK with GSearch on Solr?
> 
> Thanks,
> 
> Serhiy
> 
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure 
> contains a definitive record of customers, application performance, 
> security threats, fraudulent activity, and more. Splunk takes this 
> data and makes sense of it. IT sense. And common sense.
> http://p.sf.net/sfu/splunk-novd2d
> _______________________________________________
> Fedora-commons-users mailing list
> Fedora-commons-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users


------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Fedora-commons-users mailing list
Fedora-commons-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users

Re: [fcrepo-user] Fedora GSearch and OBJ datastream (application/pdf) extraction for Solr index

Reply via email to