One thing in order to avoid mixing xalan and saxon is to check the exts
definition in your indexing stylesheet (whether lucene or solr), from
fedoragsearch.properties:
# xsltProcessor, xalan or saxon
# this choice must be accompanied by the right namespace in your
foxmlToLucene.xslt:
# xmlns:exts="xalan://dk.defxws.fedoragsearch.server.GenericOperationsImpl"
for xalan
# xmlns:exts="java://dk.defxws.fedoragsearch.server.GenericOperationsImpl"
for saxon
fedoragsearch.xsltProcessor = xalan
Please address the Islandora community also. I appreciate very much their use
of GSearch, but I cannot answer questions about it.
-Gert
On 28/11/2011, at 11.47, Serhiy Polyakov wrote:
Gert,
I fixed the problem of indexing managed datastreams by downloading
newer GSearch 2.3 few days ago. I was trying to fix the problem with
the version I got from https://github.com/fcrepo/gsearch on October
31. I think it was beta.
So this class
dk.defxws.fedoragsearch.server.GenericOperationsImpl
is working now.
I am still getting the error with another class that comes from
Islandora and is supposed to parse MODS inline datastream. Command
line processing with Xalan gives this error that for some reason
refers to Saxon?:
file:///home/user1/4_XML_Tr/demoFoxmlToSolr.xslt;<file:////home/user1/4_XML_Tr/demoFoxmlToSolr.xslt;>
Line #298; Column
#-1; XSLT Error (net.sf.saxon.trans.XPathException): Cannot find a
matching 8-argument function named
{xalan://ca.upei.roblib.DataStreamForXSLT}getDatastreamTextRaw()
Exception in thread "main" java.lang.RuntimeException: Cannot find a
matching 8-argument function named
{xalan://ca.upei.roblib.DataStreamForXSLT}getDatastreamTextRaw()
at org.apache.xalan.xslt.Process.doExit(Process.java:1153)
at org.apache.xalan.xslt.Process.main(Process.java:1126)
Interestingly, parser worked from the command line when I removed
tomcat/webapps/fedoragsearch/WEB-INF/lib/saxon9he.jar
and also copied fedora-client-3.1.jar there (taken from GSearch 2.2).
However, in this case http://myhost:8080/fedoragsearch/rest says Saxon
is missing.
Thanks,
Serhiy
On Wed, Nov 23, 2011 at 3:23 AM, Gert Schmeltz Pedersen
<g...@dtic.dtu.dk<mailto:g...@dtic.dtu.dk>> wrote:
I can confirm that the pdf document in datastream DS2 of the demo object
demo:18 is indexed in my test installation.
If I understand you correctly, you _do_ get the pdf indexed as part of
foxml.all.text, right? So that must mean that the error is produced somewhere
else in your indexing stylesheet, maybe in line #86 as indicated in the error
message below, also, it is strange that the error message refers to saxon,
saxon cannot work, when your exts refers to xalan. Look into fedoragsearch.log
and catalina.out, there must be something.
-Gert
On 23/11/2011, at 09.26, Serhiy Polyakov wrote:
Hello,
I am trying to get OBJ datastream (application/pdf) processed and
indexed into Solr 3.4 with GSearch2.3. I excluded all MODS streams to
isolate the problem. So I have DC and OBJ (pdf)
Note: Pdf indexing was working for me in last spring installation with
GSearch 2.2 on Lucene. Summer time system with and GSearch 2.2 beta on
Solr 1.4 is not indexing pdf as well.
For the debugging I tried command line XSLT processor xalan 2.7.0 that
comes with GSearch. I include all classpath vars as I mentioned in
previous messages.
It gives this Error:
file:///home/fedora/3_XML_Pro/foxmlToSolr.xslt;<file:////home/fedora/3_XML_Pro/foxmlToSolr.xslt;>
Line #86; Column #-1;
XSLT Error (net.sf.saxon.trans.XPathException): Cannot find a matching
8-argument function named
{xalan://dk.defxws.fedoragsearch.server.GenericOperationsImpl}getDatastreamText()
Exception in thread "main" java.lang.RuntimeException: Cannot find a
matching 8-argument function named
{xalan://dk.defxws.fedoragsearch.server.GenericOperationsImpl}getDatastreamText()
at org.apache.xalan.xslt.Process.doExit(Process.java:1153)
at org.apache.xalan.xslt.Process.main(Process.java:1126)
Only when I downloaded Xalan 2.7.1 into separate directory and added
classpath to it in the command line I can process and get output file
with all the fields including OBJ fulltext extracted from pdf. I tried
to overwrite Xalan Jars that came with GSearch with new ones but it
still gives same error. Only when I am directly running Xalan 2.7.1
from the separate directory it is processing the input file.
====================
Here is excerpt from the input object's Foxml I am using to process:
<foxml:datastream ID="OBJ" FEDORA_URI="info:fedora/islandora:6/OBJ"
STATE="A" CONTROL_GROUP="M" VERSIONABLE="true">
<foxml:datastreamVersion ID="OBJ.0" LABEL="Title_2.pdf"
CREATED="2011-10-19T09:07:40.379Z" MIMETYPE="application/pdf"
SIZE="56276">
<foxml:contentLocation TYPE="INTERNAL_ID"
REF="http://myhost:8080/fedora/get/islandora:6/OBJ/2011-10-19T09:07:40.379Z"/>
====================
I am using stylesheet foxmlToSolr.xslt that came with GSearch. It has
the following lines in header:
xmlns:exts="xalan://dk.defxws.fedoragsearch.server.GenericOperationsImpl"
exclude-result-prefixes="exts"
---------------------------
And the following in the body:
<xsl:for-each select="foxml:datastream[@CONTROL_GROUP='M' or
@CONTROL_GROUP='E' or @CONTROL_GROUP='R']">
<field>
<xsl:attribute name="name">
<xsl:value-of select="concat('dsm.', @ID)"/>
</xsl:attribute>
<xsl:value-of select="exts:getDatastreamText($PID,
$REPOSITORYNAME, @ID, $FEDORASOAP, $FEDORAUSER, $FEDORAPASS,
$TRUSTSTOREPATH, $TRUSTSTOREPASS)"/>
</field>
</xsl:for-each>
====================
When objects are submitted into Fedora all inline data streams are
getting OK into the index. All non-inline (Managed) datasteams that do
not require external processing (like ORC text) are processed OK into
index. Non-inline datasteam OBJ containing pdf that require external
processing are not getting into the index.
I have this package
dk.defxws.fedoragsearch.server.GenericOperationsImpl
under
..tomcat/webapps/fedoragsearch/WEB-INF/classes
And it is used by GSearch for extraction of foxml.all.text. It means
it is visible for GSearch. Sounds like it is only when GSearh passes
pdf content of OBJ datastream for extraction it is not getting it
back.
Could somebody confirm that objects with pdf content are fulltext
indexed OK with GSearch on Solr?
Thanks,
Serhiy
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure
contains a definitive record of customers, application performance,
security threats, fraudulent activity, and more. Splunk takes this
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Fedora-commons-users mailing list
Fedora-commons-users@lists.sourceforge.net<mailto:Fedora-commons-users@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure
contains a definitive record of customers, application performance,
security threats, fraudulent activity, and more. Splunk takes this
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Fedora-commons-users mailing list
Fedora-commons-users@lists.sourceforge.net<mailto:Fedora-commons-users@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure
contains a definitive record of customers, application performance,
security threats, fraudulent activity, and more. Splunk takes this
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Fedora-commons-users mailing list
Fedora-commons-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure
contains a definitive record of customers, application performance,
security threats, fraudulent activity, and more. Splunk takes this
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Fedora-commons-users mailing list
Fedora-commons-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users