Re: [fcrepo-user] Fedora GSearch and OBJ datastream (application/pdf) extraction for Solr index

Serhiy Polyakov Wed, 23 Nov 2011 10:11:36 -0800

Gert,

In demoFoxmlToSolr there is part where it says: managed datastream is
fetched, if its mimetype can be handled, the text become the value of
the field.


All my installations of GSearch + Solr properly process managed
datastreams with MIMETYPEs “text/xml” or “text/plain”. MIMETYPEs
“application/msword” OR “application/pdf” OR “application/ps” are not
processed into index. My resulting foxml.all.text field misses
fulltext from those three MIMETYPEs.

I suppose that demoFoxmlToSolr is correct becasue  I was able to fetch
and process these three MIMETYPEs into text is in debugging mode from
the command line with Xalan 2.7.1 downloaded into separate directory.
However, I was not able to repeat that from command line with Xalan
sitting in FedoraGSearch. In latter case yes, is strange that the
error message refers to saxon.

I will be looking into the logs and testing. Will install new instance
of GSearch Solr.

Thanks,
Serhiy




On Wed, Nov 23, 2011 at 3:23 AM, Gert Schmeltz Pedersen
<g...@dtic.dtu.dk> wrote:
> I can confirm that the pdf document in datastream DS2 of the demo object 
> demo:18 is indexed in my test installation.
>
> If I understand you correctly, you _do_ get the pdf indexed as part of 
> foxml.all.text, right? So that must mean that the error is produced somewhere 
> else in your indexing stylesheet, maybe in line #86 as indicated in the error 
> message below, also, it is strange that the error message refers to saxon, 
> saxon cannot work, when your exts refers to xalan. Look into 
> fedoragsearch.log and catalina.out, there must be something.
>
> -Gert
>
>
> On 23/11/2011, at 09.26, Serhiy Polyakov wrote:
>
>> Hello,
>>
>> I am trying to get OBJ datastream (application/pdf) processed and
>> indexed into Solr 3.4 with GSearch2.3. I excluded all MODS streams to
>> isolate the problem. So I have DC and OBJ (pdf)
>>
>> Note: Pdf indexing was working for me in last spring installation with
>> GSearch 2.2 on Lucene. Summer time system with and GSearch 2.2 beta on
>> Solr 1.4 is not indexing pdf as well.
>>
>> For the debugging I tried command line XSLT processor xalan 2.7.0 that
>> comes with GSearch. I include all classpath vars as I mentioned in
>> previous messages.
>>
>> It gives this Error:
>>
>> file:///home/fedora/3_XML_Pro/foxmlToSolr.xslt; Line #86; Column #-1;
>> XSLT Error (net.sf.saxon.trans.XPathException): Cannot find a matching
>> 8-argument function named
>> {xalan://dk.defxws.fedoragsearch.server.GenericOperationsImpl}getDatastreamText()
>> Exception in thread "main" java.lang.RuntimeException: Cannot find a
>> matching 8-argument function named
>> {xalan://dk.defxws.fedoragsearch.server.GenericOperationsImpl}getDatastreamText()
>>        at org.apache.xalan.xslt.Process.doExit(Process.java:1153)
>>        at org.apache.xalan.xslt.Process.main(Process.java:1126)
>>
>> Only when I downloaded Xalan 2.7.1 into separate directory and added
>> classpath to it in the command line I can process and get output file
>> with all the fields including OBJ fulltext extracted from pdf. I tried
>> to overwrite Xalan Jars that came with GSearch with new ones but it
>> still gives same error. Only when I am directly running Xalan 2.7.1
>> from the separate directory it is processing the input file.
>>
>> ====================
>> Here is excerpt from the input object's Foxml I am using to process:
>>
>> <foxml:datastream ID="OBJ" FEDORA_URI="info:fedora/islandora:6/OBJ"
>> STATE="A" CONTROL_GROUP="M" VERSIONABLE="true">
>> <foxml:datastreamVersion ID="OBJ.0" LABEL="Title_2.pdf"
>> CREATED="2011-10-19T09:07:40.379Z" MIMETYPE="application/pdf"
>> SIZE="56276">
>> <foxml:contentLocation TYPE="INTERNAL_ID"
>> REF="http://myhost:8080/fedora/get/islandora:6/OBJ/2011-10-19T09:07:40.379Z"/>
>>
>> ====================
>> I am using stylesheet foxmlToSolr.xslt that came with GSearch. It has
>> the following lines in header:
>>
>> xmlns:exts="xalan://dk.defxws.fedoragsearch.server.GenericOperationsImpl"
>> exclude-result-prefixes="exts"
>> ---------------------------
>> And the following in the body:
>>
>> <xsl:for-each select="foxml:datastream[@CONTROL_GROUP='M' or
>> @CONTROL_GROUP='E' or @CONTROL_GROUP='R']">
>>    <field>
>>        <xsl:attribute name="name">
>>            <xsl:value-of select="concat('dsm.', @ID)"/>
>>        </xsl:attribute>
>>        <xsl:value-of select="exts:getDatastreamText($PID,
>> $REPOSITORYNAME, @ID, $FEDORASOAP, $FEDORAUSER, $FEDORAPASS,
>> $TRUSTSTOREPATH, $TRUSTSTOREPASS)"/>
>>    </field>
>> </xsl:for-each>
>> ====================
>>
>> When objects are submitted into Fedora all inline data streams are
>> getting OK into the index. All non-inline (Managed) datasteams that do
>> not require external processing (like ORC text) are processed OK into
>> index. Non-inline datasteam OBJ containing pdf that require external
>> processing are not getting into the index.
>>
>> I have this package
>> dk.defxws.fedoragsearch.server.GenericOperationsImpl
>>
>> under
>> ..tomcat/webapps/fedoragsearch/WEB-INF/classes
>>
>> And it is used by GSearch for extraction of foxml.all.text. It means
>> it is visible for GSearch. Sounds like it is only when GSearh passes
>> pdf content of OBJ datastream for extraction it is not getting it
>> back.
>>
>> Could somebody confirm that objects with pdf content are fulltext
>> indexed OK with GSearch on Solr?
>>
>> Thanks,
>>
>> Serhiy
>>
>> ------------------------------------------------------------------------------
>> All the data continuously generated in your IT infrastructure
>> contains a definitive record of customers, application performance,
>> security threats, fraudulent activity, and more. Splunk takes this
>> data and makes sense of it. IT sense. And common sense.
>> http://p.sf.net/sfu/splunk-novd2d
>> _______________________________________________
>> Fedora-commons-users mailing list
>> Fedora-commons-users@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
>
>
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure
> contains a definitive record of customers, application performance,
> security threats, fraudulent activity, and more. Splunk takes this
> data and makes sense of it. IT sense. And common sense.
> http://p.sf.net/sfu/splunk-novd2d
> _______________________________________________
> Fedora-commons-users mailing list
> Fedora-commons-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
>

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Fedora-commons-users mailing list
Fedora-commons-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users

Re: [fcrepo-user] Fedora GSearch and OBJ datastream (application/pdf) extraction for Solr index

Reply via email to