Markus Klose created SOLR-3976:
----------------------------------
Summary: HTMLStripTransformer strips the "tika" field not the
field to index -> cannot have both (stripped and unstripped)
Key: SOLR-3976
URL: https://issues.apache.org/jira/browse/SOLR-3976
Project: Solr
Issue Type: Bug
Components: contrib - DataImportHandler
Affects Versions: 3.6
Reporter: Markus Klose
Priority: Minor
I run into the situation to index an html file using the dataimport handler and
got an unexpected output. I wanted to create one field with the original
content and one field with the same content but without html markup.
If I enaple the HTMLStripTransformer at field text2 the other one (text1) is
striped as well
example configuraion:
<dataConfig>
<dataSource type="BinFileDataSource" name="bin"/>
<document>
<entity name="f" processor="FileListEntityProcessor"
recursive="true" rootEntity="false"
dataSource="null" baseDir="...." fileName=".*.html"
onError="skip" >
<entity name="tika-test"
processor="TikaEntityProcessor" url="${f.fileAbsolutePath}"
format="html" dataSource="bin" onError="skip"
transformer="HTMLStripTransformer,TemplateTransformer">
<field column="id" template="${f.file}"/>
<field column="text" name="text1" />
<field column="text" name="text2"
stripHTML="true"/>
</entity>
</entity>
</document>
</dataConfig>
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]