[
https://issues.apache.org/jira/browse/SOLR-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481356#comment-13481356
]
Markus Klose edited comment on SOLR-3976 at 10/22/12 1:35 PM:
--------------------------------------------------------------
If it sounds like "help me to index an html file" I am sorry. I just tought
that is a bug and should be posted here. Please close if necessary.
We creadted a workaround with a sub entity like:
<dataConfig>
<dataSource type="BinFileDataSource" name="bin"/>
<document>
<entity name="f" processor="FileListEntityProcessor"
recursive="true" rootEntity="false"
dataSource="null" baseDir="..." fileName=".*.html"
onError="skip" transformer="TemplateTransformer">
<entity name="tika-test"
processor="TikaEntityProcessor" url="${f.fileAbsolutePath}"
format="html" dataSource="bin" onError="skip"
transformer="TemplateTransformer,RegexTransformer,DateFormatTransformer,HTMLStripTransformer">
<field column="id" template="${f.file}"/>
<field column="text" name="text1"/>
<entity name="tika2"
processor="TikaEntityProcessor" url="${f.fileAbsolutePath}"
format="html" dataSource="bin"
onError="skip" transformer="TemplateTransformer,HTMLStripTransformer">
<field column="text" name="text2"
stripHTML="true"/>
</entity>
</entity>
</entity>
</document>
</dataConfig>
was (Author: markus-klose):
If it sounds like "help me to index an html file" I am sorry. I just tought
that is a bug and should be posted here. Please close if necessary.
We creadted a workaround with a sub entity like:
<dataConfig>
<dataSource type="BinFileDataSource" name="bin"/>
<document>
<entity name="f" processor="FileListEntityProcessor"
recursive="true" rootEntity="false"
dataSource="null" baseDir="..." fileName=".*.html"
onError="skip" transformer="TemplateTransformer">
<entity name="tika-test"
processor="TikaEntityProcessor" url="${f.fileAbsolutePath}"
format="html" dataSource="bin" onError="skip"
transformer="TemplateTransformer,RegexTransformer,DateFormatTransformer,HTMLStripTransformer">
<field column="id" template="${f.file}"/>
<field column="text" name="text1"/>
<entity name="tika2"
processor="TikaEntityProcessor" url="${f.fileAbsolutePath}"
format="html" dataSource="bin"
onError="skip" transformer="TemplateTransformer,HTMLStripTransformer">
<field column="text" name="text2"
stripHTML="false"/>
</entity>
</entity>
</entity>
</document>
</dataConfig>
> HTMLStripTransformer strips the "tika" field not the field to index -> cannot
> have both (stripped and unstripped)
> -----------------------------------------------------------------------------------------------------------------
>
> Key: SOLR-3976
> URL: https://issues.apache.org/jira/browse/SOLR-3976
> Project: Solr
> Issue Type: Bug
> Components: contrib - DataImportHandler
> Affects Versions: 3.6
> Reporter: Markus Klose
> Priority: Minor
>
> I run into the situation to index an html file using the dataimport handler
> and got an unexpected output. I wanted to create one field with the original
> content and one field with the same content but without html markup.
> If I enaple the HTMLStripTransformer at field text2 the other one (text1) is
> striped as well
> example configuraion:
> <dataConfig>
> <dataSource type="BinFileDataSource" name="bin"/>
> <document>
> <entity name="f" processor="FileListEntityProcessor"
> recursive="true" rootEntity="false"
> dataSource="null" baseDir="...." fileName=".*.html"
> onError="skip" >
>
> <entity name="tika-test"
> processor="TikaEntityProcessor" url="${f.fileAbsolutePath}"
> format="html" dataSource="bin" onError="skip"
> transformer="HTMLStripTransformer,TemplateTransformer">
>
> <field column="id" template="${f.file}"/>
>
> <field column="text" name="text1" />
> <field column="text" name="text2"
> stripHTML="true"/>
> </entity>
> </entity>
> </document>
> </dataConfig>
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]