[fcrepo-user] Fedora GSearch and OBJ datastream (application/pdf) extraction for Solr index

Serhiy Polyakov Wed, 23 Nov 2011 00:28:50 -0800

Hello,

I am trying to get OBJ datastream (application/pdf) processed and
indexed into Solr 3.4 with GSearch2.3. I excluded all MODS streams to
isolate the problem. So I have DC and OBJ (pdf)


Note: Pdf indexing was working for me in last spring installation with
GSearch 2.2 on Lucene. Summer time system with and GSearch 2.2 beta on
Solr 1.4 is not indexing pdf as well.

For the debugging I tried command line XSLT processor xalan 2.7.0 that
comes with GSearch. I include all classpath vars as I mentioned in
previous messages.

It gives this Error:

file:///home/fedora/3_XML_Pro/foxmlToSolr.xslt; Line #86; Column #-1;
XSLT Error (net.sf.saxon.trans.XPathException): Cannot find a matching
8-argument function named
{xalan://dk.defxws.fedoragsearch.server.GenericOperationsImpl}getDatastreamText()
Exception in thread "main" java.lang.RuntimeException: Cannot find a
matching 8-argument function named
{xalan://dk.defxws.fedoragsearch.server.GenericOperationsImpl}getDatastreamText()
        at org.apache.xalan.xslt.Process.doExit(Process.java:1153)
        at org.apache.xalan.xslt.Process.main(Process.java:1126)

Only when I downloaded Xalan 2.7.1 into separate directory and added
classpath to it in the command line I can process and get output file
with all the fields including OBJ fulltext extracted from pdf. I tried
to overwrite Xalan Jars that came with GSearch with new ones but it
still gives same error. Only when I am directly running Xalan 2.7.1
from the separate directory it is processing the input file.

====================
Here is excerpt from the input object's Foxml I am using to process:

<foxml:datastream ID="OBJ" FEDORA_URI="info:fedora/islandora:6/OBJ"
STATE="A" CONTROL_GROUP="M" VERSIONABLE="true">
<foxml:datastreamVersion ID="OBJ.0" LABEL="Title_2.pdf"
CREATED="2011-10-19T09:07:40.379Z" MIMETYPE="application/pdf"
SIZE="56276">
<foxml:contentLocation TYPE="INTERNAL_ID"
REF="http://myhost:8080/fedora/get/islandora:6/OBJ/2011-10-19T09:07:40.379Z"/>

====================
I am using stylesheet foxmlToSolr.xslt that came with GSearch. It has
the following lines in header:

xmlns:exts="xalan://dk.defxws.fedoragsearch.server.GenericOperationsImpl"
exclude-result-prefixes="exts"
---------------------------
And the following in the body:

<xsl:for-each select="foxml:datastream[@CONTROL_GROUP='M' or
@CONTROL_GROUP='E' or @CONTROL_GROUP='R']">
    <field>
        <xsl:attribute name="name">
            <xsl:value-of select="concat('dsm.', @ID)"/>
        </xsl:attribute>
        <xsl:value-of select="exts:getDatastreamText($PID,
$REPOSITORYNAME, @ID, $FEDORASOAP, $FEDORAUSER, $FEDORAPASS,
$TRUSTSTOREPATH, $TRUSTSTOREPASS)"/>
    </field>
</xsl:for-each>
====================

When objects are submitted into Fedora all inline data streams are
getting OK into the index. All non-inline (Managed) datasteams that do
not require external processing (like ORC text) are processed OK into
index. Non-inline datasteam OBJ containing pdf that require external
processing are not getting into the index.

I have this package
dk.defxws.fedoragsearch.server.GenericOperationsImpl

under
..tomcat/webapps/fedoragsearch/WEB-INF/classes

And it is used by GSearch for extraction of foxml.all.text. It means
it is visible for GSearch. Sounds like it is only when GSearh passes
pdf content of OBJ datastream for extraction it is not getting it
back.

Could somebody confirm that objects with pdf content are fulltext
indexed OK with GSearch on Solr?

Thanks,

Serhiy

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Fedora-commons-users mailing list
Fedora-commons-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users

[fcrepo-user] Fedora GSearch and OBJ datastream (application/pdf) extraction for Solr index

Reply via email to