Re: [fcrepo-user] Fedora GSearch and OBJ datastream (application/pdf) extraction for Solr index

Gert Schmeltz Pedersen Mon, 28 Nov 2011 03:12:26 -0800

One thing in order to avoid mixing xalan and saxon is to check the exts 
definition in your indexing stylesheet (whether lucene or solr), from 
fedoragsearch.properties:


# xsltProcessor, xalan or saxon
# this choice must be accompanied by the right namespace in your 
foxmlToLucene.xslt:
#     xmlns:exts="xalan://dk.defxws.fedoragsearch.server.GenericOperationsImpl" 
for xalan
#   xmlns:exts="java://dk.defxws.fedoragsearch.server.GenericOperationsImpl"  
for saxon
fedoragsearch.xsltProcessor = xalan

Please address the Islandora community also. I appreciate very much their use 
of GSearch, but I cannot answer questions about it.

-Gert

On 28/11/2011, at 11.47, Serhiy Polyakov wrote:

Gert,

I fixed the problem of indexing managed datastreams by downloading
newer GSearch 2.3 few days ago. I was trying to fix the problem with
the version I got from https://github.com/fcrepo/gsearch on October
31. I think it was beta.

So this class
dk.defxws.fedoragsearch.server.GenericOperationsImpl
is working now.

I am still getting the error with another class that comes from
Islandora and is supposed to parse MODS inline datastream. Command
line processing with Xalan gives this error that for some reason
refers to Saxon?:

file:///home/user1/4_XML_Tr/demoFoxmlToSolr.xslt;<file:////home/user1/4_XML_Tr/demoFoxmlToSolr.xslt;>
 Line #298; Column
#-1; XSLT Error (net.sf.saxon.trans.XPathException): Cannot find a
matching 8-argument function named
{xalan://ca.upei.roblib.DataStreamForXSLT}getDatastreamTextRaw()
Exception in thread "main" java.lang.RuntimeException: Cannot find a
matching 8-argument function named
{xalan://ca.upei.roblib.DataStreamForXSLT}getDatastreamTextRaw()
      at org.apache.xalan.xslt.Process.doExit(Process.java:1153)
      at org.apache.xalan.xslt.Process.main(Process.java:1126)

Interestingly, parser worked from the command line when I removed
tomcat/webapps/fedoragsearch/WEB-INF/lib/saxon9he.jar
and also copied fedora-client-3.1.jar there (taken from GSearch 2.2).
However, in this case http://myhost:8080/fedoragsearch/rest says Saxon
is missing.


Thanks,
Serhiy



On Wed, Nov 23, 2011 at 3:23 AM, Gert Schmeltz Pedersen
<g...@dtic.dtu.dk<mailto:g...@dtic.dtu.dk>> wrote:
I can confirm that the pdf document in datastream DS2 of the demo object 
demo:18 is indexed in my test installation.

If I understand you correctly, you _do_ get the pdf indexed as part of 
foxml.all.text, right? So that must mean that the error is produced somewhere 
else in your indexing stylesheet, maybe in line #86 as indicated in the error 
message below, also, it is strange that the error message refers to saxon, 
saxon cannot work, when your exts refers to xalan. Look into fedoragsearch.log 
and catalina.out, there must be something.

-Gert


On 23/11/2011, at 09.26, Serhiy Polyakov wrote:

Hello,

I am trying to get OBJ datastream (application/pdf) processed and
indexed into Solr 3.4 with GSearch2.3. I excluded all MODS streams to
isolate the problem. So I have DC and OBJ (pdf)

Note: Pdf indexing was working for me in last spring installation with
GSearch 2.2 on Lucene. Summer time system with and GSearch 2.2 beta on
Solr 1.4 is not indexing pdf as well.

For the debugging I tried command line XSLT processor xalan 2.7.0 that
comes with GSearch. I include all classpath vars as I mentioned in
previous messages.

It gives this Error:

file:///home/fedora/3_XML_Pro/foxmlToSolr.xslt;<file:////home/fedora/3_XML_Pro/foxmlToSolr.xslt;>
 Line #86; Column #-1;
XSLT Error (net.sf.saxon.trans.XPathException): Cannot find a matching
8-argument function named
{xalan://dk.defxws.fedoragsearch.server.GenericOperationsImpl}getDatastreamText()
Exception in thread "main" java.lang.RuntimeException: Cannot find a
matching 8-argument function named
{xalan://dk.defxws.fedoragsearch.server.GenericOperationsImpl}getDatastreamText()
      at org.apache.xalan.xslt.Process.doExit(Process.java:1153)
      at org.apache.xalan.xslt.Process.main(Process.java:1126)

Only when I downloaded Xalan 2.7.1 into separate directory and added
classpath to it in the command line I can process and get output file
with all the fields including OBJ fulltext extracted from pdf. I tried
to overwrite Xalan Jars that came with GSearch with new ones but it
still gives same error. Only when I am directly running Xalan 2.7.1
from the separate directory it is processing the input file.

====================
Here is excerpt from the input object's Foxml I am using to process:

<foxml:datastream ID="OBJ" FEDORA_URI="info:fedora/islandora:6/OBJ"
STATE="A" CONTROL_GROUP="M" VERSIONABLE="true">
<foxml:datastreamVersion ID="OBJ.0" LABEL="Title_2.pdf"
CREATED="2011-10-19T09:07:40.379Z" MIMETYPE="application/pdf"
SIZE="56276">
<foxml:contentLocation TYPE="INTERNAL_ID"
REF="http://myhost:8080/fedora/get/islandora:6/OBJ/2011-10-19T09:07:40.379Z"/>

====================
I am using stylesheet foxmlToSolr.xslt that came with GSearch. It has
the following lines in header:

xmlns:exts="xalan://dk.defxws.fedoragsearch.server.GenericOperationsImpl"
exclude-result-prefixes="exts"
---------------------------
And the following in the body:

<xsl:for-each select="foxml:datastream[@CONTROL_GROUP='M' or
@CONTROL_GROUP='E' or @CONTROL_GROUP='R']">
  <field>
      <xsl:attribute name="name">
          <xsl:value-of select="concat('dsm.', @ID)"/>
      </xsl:attribute>
      <xsl:value-of select="exts:getDatastreamText($PID,
$REPOSITORYNAME, @ID, $FEDORASOAP, $FEDORAUSER, $FEDORAPASS,
$TRUSTSTOREPATH, $TRUSTSTOREPASS)"/>
  </field>
</xsl:for-each>
====================

When objects are submitted into Fedora all inline data streams are
getting OK into the index. All non-inline (Managed) datasteams that do
not require external processing (like ORC text) are processed OK into
index. Non-inline datasteam OBJ containing pdf that require external
processing are not getting into the index.

I have this package
dk.defxws.fedoragsearch.server.GenericOperationsImpl

under
..tomcat/webapps/fedoragsearch/WEB-INF/classes

And it is used by GSearch for extraction of foxml.all.text. It means
it is visible for GSearch. Sounds like it is only when GSearh passes
pdf content of OBJ datastream for extraction it is not getting it
back.

Could somebody confirm that objects with pdf content are fulltext
indexed OK with GSearch on Solr?

Thanks,

Serhiy

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure
contains a definitive record of customers, application performance,
security threats, fraudulent activity, and more. Splunk takes this
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Fedora-commons-users mailing list
Fedora-commons-users@lists.sourceforge.net<mailto:Fedora-commons-users@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users


------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure
contains a definitive record of customers, application performance,
security threats, fraudulent activity, and more. Splunk takes this
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Fedora-commons-users mailing list
Fedora-commons-users@lists.sourceforge.net<mailto:Fedora-commons-users@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users


------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure
contains a definitive record of customers, application performance,
security threats, fraudulent activity, and more. Splunk takes this
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Fedora-commons-users mailing list
Fedora-commons-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d

_______________________________________________
Fedora-commons-users mailing list
Fedora-commons-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users

Re: [fcrepo-user] Fedora GSearch and OBJ datastream (application/pdf) extraction for Solr index

Reply via email to