Re: [fcrepo-user] Fedora GSearch and command line Xalan processing with foxmlToSolr.xslt

Gert Schmeltz Pedersen Tue, 22 Nov 2011 06:09:45 -0800

No, I did not mean that you should take the line with <dynamicField name="*"… 
out, I just wanted to know if you had.


That is a line that I added to schema.xml, and it is because of that line that 
the dc fields and potentially all other fields in a doc get included in the 
index. It should also take care of mods fields, if you did not have your 
explicitly named mods fields in schema.xml, try removing such lines. 

I do not know about the islandora-exts, it is not in GSearch, so I cannot 
answer your point (1), maybe Islandora people can. But as long as the mods 
fields with values are in the doc generated by your indexing stylesheet, then 
GSearch has finished, and it is the schema.xml and the Solr server that 
determines what gets into the index.

As to your point (2), the only differences between schema.xml in Solr 3.4 and 
in GSearch 2.3 are marked with comments in the one in GSearch 2.3, and that is 
essentially the dynamicField line.

-Gert


On 22/11/2011, at 14.21, Serhiy Polyakov wrote:

> Gert,
> 
> When I took out <dynamicField name="*"… it did not work at all.
> 
> I am observing two things:
> 
> (1)
> I am not getting fields that are extracted from datastreams using
> external functions (mods) or need processing by Solr tools (OBJ
> (application/pdf))
> 
> MODS is using in my foxmlToSolr.xslt:
> 
> islandora-exts:getXMLDatastreamASNodeList($PID, $REPOSITORYNAME,
> 'MODS', $FEDORASOAP, $FEDORAUSER, $FEDORAPASS, $TRUSTSTOREPATH,
> $TRUSTSTOREPASS)
> 
> It's class (ca/upei/roblib/DataStreamForXSLT.class) entry point is this
> [FedoraHome]/tomcat/webapps/fedoragsearch/WEB-INF/classes
> 
> I should let GSerch know about it somehow?
> 
> (2)
> Solr 3.4. must have some other than Solr 1.4.way to define fields.
> If you look at schema.xml from Solr 3.4 ~ schema.xml from GSearch 2.3
> they do not include any DC fields for example. I am getting all of
> them in my index with those schema.xml
> 
> 
> Serhiy
> 
> 
> On Tue, Nov 22, 2011 at 4:34 AM, Serhiy Polyakov <sp0...@gmail.com> wrote:
>> I forgot to mention that I am using Solr 3.4 and Fedora GSearch 2.3. I
>> think I was using wrong type of field “text”. I do not see it defined
>> in schema.xml. However, I tried other types and still no result. I
>> added just one mods field like this:
>> 
>> <field name="mods.title" type="string" indexed="true" stored="true"
>> multiValued="true"/>
>> 
>> Still it is not going to the index even output of foxmlToSolr.xslt
>> gives <field name="mods.title">Title 1</field>
>> 
>> 
>> Serhiy
>> 
>> 
>> On Tue, Nov 22, 2011 at 3:07 AM, Serhiy Polyakov <sp0...@gmail.com> wrote:
>>> Gert,
>>> 
>>> I was able to generate output from command line by using downloaded
>>> Xalan and adding class paths. But I have another question below.
>>> 
>>> So my command line is like here
>>> java -Xms512m -Xmx1024m -cp \
>>> [FedoraHome]/fedora/tomcat/webapps/fedoragsearch/WEB-INF/classes:\
>>> [FedoraHome]/DISTR_XALAN/xalan/*:\
>>> [FedoraHome]/fedora/tomcat/webapps/fedoragsearch/WEB-INF/lib/*:\
>>> [FedoraHome]/fedora/solr_dir/contrib/extraction/lib/*: \
>>> org.apache.xalan.xslt.Process \
>>> -PARAM FEDORASOAP 'http://localhost:8080/fedora/services' \
>>> -PARAM REPOSITORYNAME 'SomeName' \
>>> -PARAM FEDORAUSER 'fedoraAdmin' \
>>> -PARAM FEDORAPASS 'SomePassword' \
>>> -PARAM TRUSTSTOREPATH '[FedoraHome]/fedora/server/truststore' \
>>> -PARAM TRUSTSTOREPASS 'SomePassword' \
>>> -in [FileIn.xml] \
>>> -xsl foxmlToSolr.xslt \
>>> -out [FileOut.xml]
>>> 
>>> All managed content is getting into the FileOut.xml including PDF as a
>>> text. Here is excerpt:
>>> <field name="dc.title">Pdf docum</field>
>>> <field name="mods.title">Pdf docum</field>
>>> <field name="dsm.OBJ">extracted content</field>
>>> 
>>> 
>>> Another question. Now I am trying to get the fields into Solr Index.
>>> All fields except mods.* are going there. My steps:
>>> 
>>> (1) Edit foxmlToSolr.xslt so that I am getting all metadata fields I
>>> need in the output (confirmed using command line method above).
>>> 
>>> (2) Edit schema.xml for Solr adding statements like here:
>>> <copyField source="mods.title" dest="mods.title_s" />
>>> <field name="mods.title" type="text" indexed="true" stored="false"
>>> multiValued="true"/>
>>> <field name="mods.title_s" type="string" maxChars="300" indexed="true"
>>> stored="true"/>
>>> 
>>> After this I stopped Tomcat, deleted index, started Tomcat, updated
>>> index using Fedora GSearch web admin.
>>> 
>>> No MODS fields in the created index (I looked up with Luke)? I have
>>> all other fields created OK, like dc.*, dsm.OCR and others.
>>> 
>>> Do I need to edit other files except two above? Any suggestions would help.
>>> 
>>> Thanks,
>>> Serhiy
>>> 
>>> 
>>> 
>>> On Mon, Nov 21, 2011 at 3:48 AM, Gert Schmeltz Pedersen
>>> <g...@dtic.dtu.dk> wrote:
>>>> Hi Serhiy,
>>>> 
>>>> I think that you are missing
>>>> dk.defxws.fedoragsearch.server.GenericOperationsImpl
>>>> and related classes from the classpath, when you run from command line. 
>>>> Let me know how it goes.
>>>> 
>>>> -Gert
>>>> 
>>>> 
>>>> On 21/11/2011, at 10.04, Serhiy Polyakov wrote:
>>>> 
>>>>> At first I did not pass parameters to the exts:getDatastreamText
>>>>> I did it now. Still no OCR text content if OUT.txt fields.
>>>>> 
>>>>> Serhiy
>>>>> 
>>>>> 
>>>>> On Mon, Nov 21, 2011 at 2:27 AM, Serhiy Polyakov <sp0...@gmail.com> wrote:
>>>>>> Hello,
>>>>>> 
>>>>>> I want to use command line to process exported Fedora object using
>>>>>> foxmlToSolr.xslt stylesheet. I need to see the resulting document that
>>>>>> will be used by solr/conf/schema.xml to create index.
>>>>>> 
>>>>>> Object's Foxml includes inline DC datastream and managed (external)
>>>>>> OCR datastream that contains text/plain. Foxml includes reference to
>>>>>> OCR datastream on the local server like
>>>>>> http://localhost:8080/fedora/get/... I pointed browser to the OCR
>>>>>> datastream reference and I see the text there. My FedoraGSearch
>>>>>> indexed DC and OCR alright as a part of regular workflow so
>>>>>> foxmlToSolr.xslt must be correct.
>>>>>> 
>>>>>> However I need to do transformation from command line for the
>>>>>> analysts. I downloaded Xalan and run:
>>>>>> 
>>>>>> java -cp dk/defxws/fedoragsearch/server:path/to/xalan/*:
>>>>>> org.apache.xalan.xslt.Process -in <SOURCE.xml> -xsl foxmlToSolr.xslt
>>>>>> -out <OUT.txt>
>>>>>> 
>>>>>> Here is excerpt from OUT.txt
>>>>>> <field name=”dc.title”>My Title</field>
>>>>>> <field name=”dsm.OCR”/>
>>>>>> 
>>>>>> So it is not grabbing managed content (OCR in my case).
>>>>>> 
>>>>>> foxmlToSolr.xslt includes external function definition and I believe
>>>>>> is using it for managed content:
>>>>>> ======
>>>>>> …
>>>>>> xmlns:exts="xalan://dk.defxws.fedoragsearch.server.GenericOperationsImpl"
>>>>>> …
>>>>>> xsl:value-of select="exts:getDatastreamText($PID, $REPOSITORYNAME,
>>>>>> @ID, $FEDORASOAP, $FEDORAUSER, $FEDORAPASS, $TRUSTSTOREPATH,
>>>>>> $TRUSTSTOREPASS)"/>
>>>>>> …
>>>>>> =====
>>>>>> 
>>>>>> Could somebody suggest me if this is at all possible to get managed
>>>>>> content into the output when I am doing command line processing.
>>>>>> Again, managed content is getting to the index as part of regular
>>>>>> FedoraGSearch workflow with the same foxmlToSolr.xslt.
>>>>>> 
>>>>>> Thanks,
>>>>>> Serhiy
>>>>>> 
>>>>> 
>>>>> ------------------------------------------------------------------------------
>>>>> All the data continuously generated in your IT infrastructure
>>>>> contains a definitive record of customers, application performance,
>>>>> security threats, fraudulent activity, and more. Splunk takes this
>>>>> data and makes sense of it. IT sense. And common sense.
>>>>> http://p.sf.net/sfu/splunk-novd2d
>>>>> _______________________________________________
>>>>> Fedora-commons-users mailing list
>>>>> Fedora-commons-users@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
>>>> 
>>>> 
>>>> ------------------------------------------------------------------------------
>>>> All the data continuously generated in your IT infrastructure
>>>> contains a definitive record of customers, application performance,
>>>> security threats, fraudulent activity, and more. Splunk takes this
>>>> data and makes sense of it. IT sense. And common sense.
>>>> http://p.sf.net/sfu/splunk-novd2d
>>>> _______________________________________________
>>>> Fedora-commons-users mailing list
>>>> Fedora-commons-users@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
>>>> 
>>> 
>> 
> 
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure
> contains a definitive record of customers, application performance,
> security threats, fraudulent activity, and more. Splunk takes this
> data and makes sense of it. IT sense. And common sense.
> http://p.sf.net/sfu/splunk-novd2d
> _______________________________________________
> Fedora-commons-users mailing list
> Fedora-commons-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users


------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Fedora-commons-users mailing list
Fedora-commons-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users

Re: [fcrepo-user] Fedora GSearch and command line Xalan processing with foxmlToSolr.xslt

Reply via email to