[
https://issues.apache.org/jira/browse/SOLR-6856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14290131#comment-14290131
]
Steve Rowe edited comment on SOLR-6856 at 1/23/15 10:36 PM:
------------------------------------------------------------
I noticed when using {{IdentityHtmlMapper}} that the {{<br/>}} tag causes the
string "none\n" to show up in the catch-all {{content}} field Tika produces,
e.g. for
{code:xml}
<p>distinct<br/>words</p>
{code}
the following is extracted in the catch-all field:
{noformat}
distinctnone
words
{noformat}
I suspect this is a Tika bug, but I didn't track down why this happens.
I addressed this problem by copy/pasting {{IdentityHtmlMapper}} into a nested
class of {{ExtractingDocumentLoader}} and overriding {{mapSafeElement(String
name)}} to return {{null}} when {{name}} is {{br}} - this causes Tika to not
output the "none\n" string in the catch-all field.
I noticed that {{DefaultHtmlMapper}} excludes {{<SCRIPT>}} and {{<STYLE>}} tags
and their content, while {{IdentityHtmlMapper}} does not. I thought I would
need to also address this, because a general-purpose HTML extraction facility
should not include that {{<SCRIPT>}} or {{<STYLE>}} content. But apparently
Tika handles exclusion of these tags and their content at some location other
than the {{HtmlMapper}} - even when using {{IdentityHtmlMapper}}, no
start-element events are created for these tags. Nevertheless, I added a test
to make sure that {{<SCRIPT>}} and {{<STYLE>}} content is not extracted.
I added a new test to more fully demonstrate that {{xpath}} handling works
properly.
I think it's ready to go. I'm running all Solr tests now.
was (Author: steve_rowe):
I noticed when using {{DefaultHtmlMapper}} that the {{<br/>}} tag causes the
string "none\n" to show up in the catch-all {{content}} field Tika produces,
e.g. for
{code:xml}
<p>distinct<br/>words</p>
{code}
the following is extracted in the catch-all field:
{noformat}
distinctnone
words
{noformat}
I suspect this is a Tika bug, but I didn't track down why this happens.
I addressed this problem by copy/pasting {{IdentityHtmlMapper}} into a nested
class of {{ExtractingDocumentLoader}} and overriding {{mapSafeElement(String
name)}} to return {{null}} when {{name}} is {{br}} - this causes Tika to not
output the "none\n" string in the catch-all field.
I noticed that {{DefaultHtmlMapper}} excludes {{<SCRIPT>}} and {{<STYLE>}} tags
and their content, while {{IdentityHtmlMapper}} does not. I thought I would
need to also address this, because a general-purpose HTML extraction facility
should not include that {{<SCRIPT>}} or {{<STYLE>}} content. But apparently
Tika handles exclusion of these tags and their content at some location other
than the {{HtmlMapper}} - even when using {{IdentityHtmlMapper}}, no
start-element events are created for these tags. Nevertheless, I added a test
to make sure that {{<SCRIPT>}} and {{<STYLE>}} content is not extracted.
I added a new test to more fully demonstrate that {{xpath}} handling works
properly.
I think it's ready to go. I'm running all Solr tests now.
> regression in /update/extract ? ref guide examples of fmap & xpath don't seem
> to be working
> --------------------------------------------------------------------------------------------
>
> Key: SOLR-6856
> URL: https://issues.apache.org/jira/browse/SOLR-6856
> Project: Solr
> Issue Type: Bug
> Affects Versions: 3.1
> Reporter: Hoss Man
> Priority: Blocker
> Attachments: SOLR-6856.patch, SOLR-6856.patch
>
>
> I updated this page to know about hte new bin/solr and example/exampledocs
> structure/contents...
> https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika
> however i noticed that several of the examples listed on that page didn't
> seem to work any more -- notably...
> * examples using "fmap" don't seem to create the fields they say they will
> * examples using "xpath" don't seem to create any docs at all
> Specific examples i had problems with...
> {noformat}
> curl
> "http://localhost:8983/solr/techproducts/update/extract?literal.id=doc2&captureAttr=true&defaultField=text&fmap.div=foo_t&capture=div&commit=true"
> -F "sample=@example/exampledocs/sample.html"
> curl
> "http://localhost:8983/solr/techproducts/update/extract?literal.id=doc3&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3&commit=true"
> -F "sample=@example/exampledocs/sample.html"
> curl
> "http://localhost:8983/solr/techproducts/update/extract?literal.id=doc4&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3&literal.blah_s=Bah&commit=true"
> -F "sample=@example/exampledocs/sample.html"
> curl
> "http://localhost:8983/solr/techproducts/update/extract?literal.id=doc5&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3&literal.id=id&xpath=/xhtml:html/xhtml:body/xhtml:div/descendant:node()&commit=true"
> -F "sample=@example/exampledocs/sample.html"
> {noformat}
> ...none of these example commands produced an error, but they also didn't
> seem to create the fields/docs they said they would (ie: no "foo_t" field was
> created)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]