Gert, I fixed the problem of indexing managed datastreams by downloading newer GSearch 2.3 few days ago. I was trying to fix the problem with the version I got from https://github.com/fcrepo/gsearch on October 31. I think it was beta.
So this class dk.defxws.fedoragsearch.server.GenericOperationsImpl is working now. I am still getting the error with another class that comes from Islandora and is supposed to parse MODS inline datastream. Command line processing with Xalan gives this error that for some reason refers to Saxon?: file:///home/user1/4_XML_Tr/demoFoxmlToSolr.xslt; Line #298; Column #-1; XSLT Error (net.sf.saxon.trans.XPathException): Cannot find a matching 8-argument function named {xalan://ca.upei.roblib.DataStreamForXSLT}getDatastreamTextRaw() Exception in thread "main" java.lang.RuntimeException: Cannot find a matching 8-argument function named {xalan://ca.upei.roblib.DataStreamForXSLT}getDatastreamTextRaw() at org.apache.xalan.xslt.Process.doExit(Process.java:1153) at org.apache.xalan.xslt.Process.main(Process.java:1126) Interestingly, parser worked from the command line when I removed tomcat/webapps/fedoragsearch/WEB-INF/lib/saxon9he.jar and also copied fedora-client-3.1.jar there (taken from GSearch 2.2). However, in this case http://myhost:8080/fedoragsearch/rest says Saxon is missing. Thanks, Serhiy On Wed, Nov 23, 2011 at 3:23 AM, Gert Schmeltz Pedersen <g...@dtic.dtu.dk> wrote: > I can confirm that the pdf document in datastream DS2 of the demo object > demo:18 is indexed in my test installation. > > If I understand you correctly, you _do_ get the pdf indexed as part of > foxml.all.text, right? So that must mean that the error is produced somewhere > else in your indexing stylesheet, maybe in line #86 as indicated in the error > message below, also, it is strange that the error message refers to saxon, > saxon cannot work, when your exts refers to xalan. Look into > fedoragsearch.log and catalina.out, there must be something. > > -Gert > > > On 23/11/2011, at 09.26, Serhiy Polyakov wrote: > >> Hello, >> >> I am trying to get OBJ datastream (application/pdf) processed and >> indexed into Solr 3.4 with GSearch2.3. I excluded all MODS streams to >> isolate the problem. So I have DC and OBJ (pdf) >> >> Note: Pdf indexing was working for me in last spring installation with >> GSearch 2.2 on Lucene. Summer time system with and GSearch 2.2 beta on >> Solr 1.4 is not indexing pdf as well. >> >> For the debugging I tried command line XSLT processor xalan 2.7.0 that >> comes with GSearch. I include all classpath vars as I mentioned in >> previous messages. >> >> It gives this Error: >> >> file:///home/fedora/3_XML_Pro/foxmlToSolr.xslt; Line #86; Column #-1; >> XSLT Error (net.sf.saxon.trans.XPathException): Cannot find a matching >> 8-argument function named >> {xalan://dk.defxws.fedoragsearch.server.GenericOperationsImpl}getDatastreamText() >> Exception in thread "main" java.lang.RuntimeException: Cannot find a >> matching 8-argument function named >> {xalan://dk.defxws.fedoragsearch.server.GenericOperationsImpl}getDatastreamText() >> at org.apache.xalan.xslt.Process.doExit(Process.java:1153) >> at org.apache.xalan.xslt.Process.main(Process.java:1126) >> >> Only when I downloaded Xalan 2.7.1 into separate directory and added >> classpath to it in the command line I can process and get output file >> with all the fields including OBJ fulltext extracted from pdf. I tried >> to overwrite Xalan Jars that came with GSearch with new ones but it >> still gives same error. Only when I am directly running Xalan 2.7.1 >> from the separate directory it is processing the input file. >> >> ==================== >> Here is excerpt from the input object's Foxml I am using to process: >> >> <foxml:datastream ID="OBJ" FEDORA_URI="info:fedora/islandora:6/OBJ" >> STATE="A" CONTROL_GROUP="M" VERSIONABLE="true"> >> <foxml:datastreamVersion ID="OBJ.0" LABEL="Title_2.pdf" >> CREATED="2011-10-19T09:07:40.379Z" MIMETYPE="application/pdf" >> SIZE="56276"> >> <foxml:contentLocation TYPE="INTERNAL_ID" >> REF="http://myhost:8080/fedora/get/islandora:6/OBJ/2011-10-19T09:07:40.379Z"/> >> >> ==================== >> I am using stylesheet foxmlToSolr.xslt that came with GSearch. It has >> the following lines in header: >> >> xmlns:exts="xalan://dk.defxws.fedoragsearch.server.GenericOperationsImpl" >> exclude-result-prefixes="exts" >> --------------------------- >> And the following in the body: >> >> <xsl:for-each select="foxml:datastream[@CONTROL_GROUP='M' or >> @CONTROL_GROUP='E' or @CONTROL_GROUP='R']"> >> <field> >> <xsl:attribute name="name"> >> <xsl:value-of select="concat('dsm.', @ID)"/> >> </xsl:attribute> >> <xsl:value-of select="exts:getDatastreamText($PID, >> $REPOSITORYNAME, @ID, $FEDORASOAP, $FEDORAUSER, $FEDORAPASS, >> $TRUSTSTOREPATH, $TRUSTSTOREPASS)"/> >> </field> >> </xsl:for-each> >> ==================== >> >> When objects are submitted into Fedora all inline data streams are >> getting OK into the index. All non-inline (Managed) datasteams that do >> not require external processing (like ORC text) are processed OK into >> index. Non-inline datasteam OBJ containing pdf that require external >> processing are not getting into the index. >> >> I have this package >> dk.defxws.fedoragsearch.server.GenericOperationsImpl >> >> under >> ..tomcat/webapps/fedoragsearch/WEB-INF/classes >> >> And it is used by GSearch for extraction of foxml.all.text. It means >> it is visible for GSearch. Sounds like it is only when GSearh passes >> pdf content of OBJ datastream for extraction it is not getting it >> back. >> >> Could somebody confirm that objects with pdf content are fulltext >> indexed OK with GSearch on Solr? >> >> Thanks, >> >> Serhiy >> >> ------------------------------------------------------------------------------ >> All the data continuously generated in your IT infrastructure >> contains a definitive record of customers, application performance, >> security threats, fraudulent activity, and more. Splunk takes this >> data and makes sense of it. IT sense. And common sense. >> http://p.sf.net/sfu/splunk-novd2d >> _______________________________________________ >> Fedora-commons-users mailing list >> Fedora-commons-users@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users > > > ------------------------------------------------------------------------------ > All the data continuously generated in your IT infrastructure > contains a definitive record of customers, application performance, > security threats, fraudulent activity, and more. Splunk takes this > data and makes sense of it. IT sense. And common sense. > http://p.sf.net/sfu/splunk-novd2d > _______________________________________________ > Fedora-commons-users mailing list > Fedora-commons-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/fedora-commons-users > ------------------------------------------------------------------------------ All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-novd2d _______________________________________________ Fedora-commons-users mailing list Fedora-commons-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/fedora-commons-users