Gert, In demoFoxmlToSolr there is part where it says: managed datastream is fetched, if its mimetype can be handled, the text become the value of the field.
All my installations of GSearch + Solr properly process managed datastreams with MIMETYPEs “text/xml” or “text/plain”. MIMETYPEs “application/msword” OR “application/pdf” OR “application/ps” are not processed into index. My resulting foxml.all.text field misses fulltext from those three MIMETYPEs. I suppose that demoFoxmlToSolr is correct becasue I was able to fetch and process these three MIMETYPEs into text is in debugging mode from the command line with Xalan 2.7.1 downloaded into separate directory. However, I was not able to repeat that from command line with Xalan sitting in FedoraGSearch. In latter case yes, is strange that the error message refers to saxon. I will be looking into the logs and testing. Will install new instance of GSearch Solr. Thanks, Serhiy On Wed, Nov 23, 2011 at 3:23 AM, Gert Schmeltz Pedersen <g...@dtic.dtu.dk> wrote: > I can confirm that the pdf document in datastream DS2 of the demo object > demo:18 is indexed in my test installation. > > If I understand you correctly, you _do_ get the pdf indexed as part of > foxml.all.text, right? So that must mean that the error is produced somewhere > else in your indexing stylesheet, maybe in line #86 as indicated in the error > message below, also, it is strange that the error message refers to saxon, > saxon cannot work, when your exts refers to xalan. Look into > fedoragsearch.log and catalina.out, there must be something. > > -Gert > > > On 23/11/2011, at 09.26, Serhiy Polyakov wrote: > >> Hello, >> >> I am trying to get OBJ datastream (application/pdf) processed and >> indexed into Solr 3.4 with GSearch2.3. I excluded all MODS streams to >> isolate the problem. So I have DC and OBJ (pdf) >> >> Note: Pdf indexing was working for me in last spring installation with >> GSearch 2.2 on Lucene. Summer time system with and GSearch 2.2 beta on >> Solr 1.4 is not indexing pdf as well. >> >> For the debugging I tried command line XSLT processor xalan 2.7.0 that >> comes with GSearch. I include all classpath vars as I mentioned in >> previous messages. >> >> It gives this Error: >> >> file:///home/fedora/3_XML_Pro/foxmlToSolr.xslt; Line #86; Column #-1; >> XSLT Error (net.sf.saxon.trans.XPathException): Cannot find a matching >> 8-argument function named >> {xalan://dk.defxws.fedoragsearch.server.GenericOperationsImpl}getDatastreamText() >> Exception in thread "main" java.lang.RuntimeException: Cannot find a >> matching 8-argument function named >> {xalan://dk.defxws.fedoragsearch.server.GenericOperationsImpl}getDatastreamText() >> at org.apache.xalan.xslt.Process.doExit(Process.java:1153) >> at org.apache.xalan.xslt.Process.main(Process.java:1126) >> >> Only when I downloaded Xalan 2.7.1 into separate directory and added >> classpath to it in the command line I can process and get output file >> with all the fields including OBJ fulltext extracted from pdf. I tried >> to overwrite Xalan Jars that came with GSearch with new ones but it >> still gives same error. Only when I am directly running Xalan 2.7.1 >> from the separate directory it is processing the input file. >> >> ==================== >> Here is excerpt from the input object's Foxml I am using to process: >> >> <foxml:datastream ID="OBJ" FEDORA_URI="info:fedora/islandora:6/OBJ" >> STATE="A" CONTROL_GROUP="M" VERSIONABLE="true"> >> <foxml:datastreamVersion ID="OBJ.0" LABEL="Title_2.pdf" >> CREATED="2011-10-19T09:07:40.379Z" MIMETYPE="application/pdf" >> SIZE="56276"> >> <foxml:contentLocation TYPE="INTERNAL_ID" >> REF="http://myhost:8080/fedora/get/islandora:6/OBJ/2011-10-19T09:07:40.379Z"/> >> >> ==================== >> I am using stylesheet foxmlToSolr.xslt that came with GSearch. It has >> the following lines in header: >> >> xmlns:exts="xalan://dk.defxws.fedoragsearch.server.GenericOperationsImpl" >> exclude-result-prefixes="exts" >> --------------------------- >> And the following in the body: >> >> <xsl:for-each select="foxml:datastream[@CONTROL_GROUP='M' or >> @CONTROL_GROUP='E' or @CONTROL_GROUP='R']"> >> <field> >> <xsl:attribute name="name"> >> <xsl:value-of select="concat('dsm.', @ID)"/> >> </xsl:attribute> >> <xsl:value-of select="exts:getDatastreamText($PID, >> $REPOSITORYNAME, @ID, $FEDORASOAP, $FEDORAUSER, $FEDORAPASS, >> $TRUSTSTOREPATH, $TRUSTSTOREPASS)"/> >> </field> >> </xsl:for-each> >> ==================== >> >> When objects are submitted into Fedora all inline data streams are >> getting OK into the index. All non-inline (Managed) datasteams that do >> not require external processing (like ORC text) are processed OK into >> index. Non-inline datasteam OBJ containing pdf that require external >> processing are not getting into the index. >> >> I have this package >> dk.defxws.fedoragsearch.server.GenericOperationsImpl >> >> under >> ..tomcat/webapps/fedoragsearch/WEB-INF/classes >> >> And it is used by GSearch for extraction of foxml.all.text. It means >> it is visible for GSearch. Sounds like it is only when GSearh passes >> pdf content of OBJ datastream for extraction it is not getting it >> back. >> >> Could somebody confirm that objects with pdf content are fulltext >> indexed OK with GSearch on Solr? >> >> Thanks, >> >> Serhiy >> >> ------------------------------------------------------------------------------ >> All the data continuously generated in your IT infrastructure >> contains a definitive record of customers, application performance, >> security threats, fraudulent activity, and more. Splunk takes this >> data and makes sense of it. IT sense. And common sense. >> http://p.sf.net/sfu/splunk-novd2d >> _______________________________________________ >> Fedora-commons-users mailing list >> Fedora-commons-users@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users > > > ------------------------------------------------------------------------------ > All the data continuously generated in your IT infrastructure > contains a definitive record of customers, application performance, > security threats, fraudulent activity, and more. Splunk takes this > data and makes sense of it. IT sense. And common sense. > http://p.sf.net/sfu/splunk-novd2d > _______________________________________________ > Fedora-commons-users mailing list > Fedora-commons-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/fedora-commons-users > ------------------------------------------------------------------------------ All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-novd2d _______________________________________________ Fedora-commons-users mailing list Fedora-commons-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/fedora-commons-users