[jira] Updated: (NUTCH-185) XMLParser is configurable xml parser plugin.

Rida Benjelloun (JIRA) Wed, 01 Feb 2006 12:11:51 -0800

     [ http://issues.apache.org/jira/browse/NUTCH-185?page=all ]


Rida Benjelloun updated NUTCH-185:
----------------------------------

        Summary: XMLParser is configurable xml parser plugin.   (was: XMLParser 
is configurable plugin. It use XPath and namespaces to do the mapping between 
the XML elements and Lucene fields.)
    Description: 
Xml parser  is configurable plugin. It use XPath and namespaces to do the 
mapping between the XML elements and Lucene fields. 

Informations :

1- Copy "xmlparser-conf.xml" to the nutch/conf dir

2- To index your custom XML file, you have to modify the "xmlparser-conf.xml". 
This parser uses namespaces and XPATH to parse XML content
The config file do the mapping between the XML noeds (using XPATH) and lucene 
field. 
Example : <field name="dctitle" xpath="//dc:title" type="Text" boost="1.4" /> 

3- The xmlIndexerProperties encapsulate a set of fields associated to a 
namespace. 
If the namespace is found in the xml document, the fields represented by the 
namespace will be indexed.
Example : 
<xmlIndexerProperties type="filePerDocument" namespace=" 
http://purl.org/dc/elements/1.1/";>
  <field name="dctitle" xpath="//dc:title" type="Text" boost=" 1.4" /> 
  <field name="dccreator" xpath="//dc:creator" type="keyword" boost=" 1.0" /> 
</xmlIndexerProperties>


4- It is possible to define a default namespace that will be applied when the 
parser 
didn't find any namespace in the document or when the namespace found in the 
xml document doesn't match with the namespace defined in the 
xmlIndexerProperties. 
Example :
<xmlIndexerProperties type="filePerDocument" namespace="default">
  <field name="xmlcontent" xpath="//*" type="Unstored" boost="1.0" /> 
</xmlIndexerProperties>


  was:
XMLParser is configurable plugin. It use XPath and namespaces to do the mapping 
between the XML elements and Lucene fields. 

Informations :

1- Copy "xmlparser-conf.xml" to the nutch/conf dir

2- To index your custom XML file, you have to modify the "xmlparser-conf.xml". 
This parser uses namespaces and XPATH to parse XML content
The config file do the mapping between the XML noeds (using XPATH) and lucene 
field. 
Example : <field name="dctitle" xpath="//dc:title" type="Text" boost="1.4" /> 

3- The xmlIndexerProperties encapsulate a set of fields associated to a 
namespace. 
If the namespace is found in the xml document, the fields represented by the 
namespace will be indexed.
Example : 
<xmlIndexerProperties type="filePerDocument" namespace=" 
http://purl.org/dc/elements/1.1/";>
  <field name="dctitle" xpath="//dc:title" type="Text" boost=" 1.4" /> 
  <field name="dccreator" xpath="//dc:creator" type="keyword" boost=" 1.0" /> 
</xmlIndexerProperties>


4- It is possible to define a default namespace that will be applied when the 
parser 
didn't find any namespace in the document or when the namespace found in the 
xml document doesn't match with the namespace defined in the 
xmlIndexerProperties. 
Example :
<xmlIndexerProperties type="filePerDocument" namespace="default">
  <field name="xmlcontent" xpath="//*" type="Unstored" boost="1.0" /> 
</xmlIndexerProperties>



> XMLParser is configurable xml parser plugin. 
> ---------------------------------------------
>
>          Key: NUTCH-185
>          URL: http://issues.apache.org/jira/browse/NUTCH-185
>      Project: Nutch
>         Type: New Feature
>   Components: fetcher, indexer
>     Versions: 0.7.2-dev
>  Environment: OS Independent
>     Reporter: Rida Benjelloun
>  Attachments: parse-xml.zip
>
> Xml parser  is configurable plugin. It use XPath and namespaces to do the 
> mapping between the XML elements and Lucene fields. 
> Informations :
> 1- Copy "xmlparser-conf.xml" to the nutch/conf dir
> 2- To index your custom XML file, you have to modify the 
> "xmlparser-conf.xml". 
> This parser uses namespaces and XPATH to parse XML content
> The config file do the mapping between the XML noeds (using XPATH) and lucene 
> field. 
> Example : <field name="dctitle" xpath="//dc:title" type="Text" boost="1.4" /> 
> 3- The xmlIndexerProperties encapsulate a set of fields associated to a 
> namespace. 
> If the namespace is found in the xml document, the fields represented by the 
> namespace will be indexed.
> Example : 
> <xmlIndexerProperties type="filePerDocument" namespace=" 
> http://purl.org/dc/elements/1.1/";>
>   <field name="dctitle" xpath="//dc:title" type="Text" boost=" 1.4" /> 
>   <field name="dccreator" xpath="//dc:creator" type="keyword" boost=" 1.0" /> 
> </xmlIndexerProperties>
> 4- It is possible to define a default namespace that will be applied when the 
> parser 
> didn't find any namespace in the document or when the namespace found in the 
> xml document doesn't match with the namespace defined in the 
> xmlIndexerProperties. 
> Example :
> <xmlIndexerProperties type="filePerDocument" namespace="default">
>   <field name="xmlcontent" xpath="//*" type="Unstored" boost="1.0" /> 
> </xmlIndexerProperties>

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-185) XMLParser is configurable xml parser plugin.

Reply via email to