extract ? ref guide examples of fmap & xpath don't seem to be working

Steve Rowe (JIRA) Fri, 23 Jan 2015 14:36:54 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-6856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14290131#comment-14290131
 ]


Steve Rowe edited comment on SOLR-6856 at 1/23/15 10:36 PM:
------------------------------------------------------------

I noticed when using {{IdentityHtmlMapper}} that the {{<br/>}} tag causes the 
string "none\n" to show up in the catch-all {{content}} field Tika produces, 
e.g. for 

{code:xml}
<p>distinct<br/>words</p>
{code}

the following is extracted in the catch-all field:

{noformat}
 distinctnone
words
{noformat}

I suspect this is a Tika bug, but I didn't track down why this happens.

I addressed this problem by copy/pasting {{IdentityHtmlMapper}} into a nested 
class of {{ExtractingDocumentLoader}} and overriding {{mapSafeElement(String 
name)}} to return {{null}} when {{name}} is {{br}} - this causes Tika to not 
output the "none\n" string in the catch-all field.

I noticed that {{DefaultHtmlMapper}} excludes {{<SCRIPT>}} and {{<STYLE>}} tags 
and their content, while {{IdentityHtmlMapper}} does not.  I thought I would 
need to also address this, because a general-purpose HTML extraction facility 
should not include that {{<SCRIPT>}} or {{<STYLE>}} content.  But apparently 
Tika handles exclusion of these tags and their content at some location other 
than the {{HtmlMapper}} - even when using {{IdentityHtmlMapper}}, no 
start-element events are created for these tags.  Nevertheless, I added a test 
to make sure that {{<SCRIPT>}} and {{<STYLE>}} content is not extracted.

I added a new test to more fully demonstrate that {{xpath}} handling works 
properly.

I think it's ready to go. I'm running all Solr tests now.


was (Author: steve_rowe):
I noticed when using {{DefaultHtmlMapper}} that the {{<br/>}} tag causes the 
string "none\n" to show up in the catch-all {{content}} field Tika produces, 
e.g. for 

{code:xml}
<p>distinct<br/>words</p>
{code}

the following is extracted in the catch-all field:

{noformat}
 distinctnone
words
{noformat}

I suspect this is a Tika bug, but I didn't track down why this happens.

I addressed this problem by copy/pasting {{IdentityHtmlMapper}} into a nested 
class of {{ExtractingDocumentLoader}} and overriding {{mapSafeElement(String 
name)}} to return {{null}} when {{name}} is {{br}} - this causes Tika to not 
output the "none\n" string in the catch-all field.

I noticed that {{DefaultHtmlMapper}} excludes {{<SCRIPT>}} and {{<STYLE>}} tags 
and their content, while {{IdentityHtmlMapper}} does not.  I thought I would 
need to also address this, because a general-purpose HTML extraction facility 
should not include that {{<SCRIPT>}} or {{<STYLE>}} content.  But apparently 
Tika handles exclusion of these tags and their content at some location other 
than the {{HtmlMapper}} - even when using {{IdentityHtmlMapper}}, no 
start-element events are created for these tags.  Nevertheless, I added a test 
to make sure that {{<SCRIPT>}} and {{<STYLE>}} content is not extracted.

I added a new test to more fully demonstrate that {{xpath}} handling works 
properly.

I think it's ready to go. I'm running all Solr tests now.

> regression in /update/extract ? ref guide examples of fmap & xpath don't seem 
> to be working 
> --------------------------------------------------------------------------------------------
>
>                 Key: SOLR-6856
>                 URL: https://issues.apache.org/jira/browse/SOLR-6856
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 3.1
>            Reporter: Hoss Man
>            Priority: Blocker
>         Attachments: SOLR-6856.patch, SOLR-6856.patch
>
>
> I updated this page to know about hte new bin/solr and example/exampledocs 
> structure/contents...
> https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika
> however i noticed that several of the examples listed on that page didn't 
> seem to work any more -- notably...
> * examples using "fmap" don't seem to create the fields they say they will
> * examples using "xpath" don't seem to create any docs at all
> Specific examples i had problems with...
> {noformat}
> curl 
> "http://localhost:8983/solr/techproducts/update/extract?literal.id=doc2&captureAttr=true&defaultField=text&fmap.div=foo_t&capture=div&commit=true";
>  -F "sample=@example/exampledocs/sample.html"
> curl 
> "http://localhost:8983/solr/techproducts/update/extract?literal.id=doc3&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3&commit=true";
>  -F "sample=@example/exampledocs/sample.html"
> curl 
> "http://localhost:8983/solr/techproducts/update/extract?literal.id=doc4&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3&literal.blah_s=Bah&commit=true";
>  -F "sample=@example/exampledocs/sample.html"
> curl 
> "http://localhost:8983/solr/techproducts/update/extract?literal.id=doc5&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3&literal.id=id&xpath=/xhtml:html/xhtml:body/xhtml:div/descendant:node()&commit=true"
>  -F "sample=@example/exampledocs/sample.html"
> {noformat}
> ...none of these example commands produced an error, but they also didn't 
> seem to create the fields/docs they said they would (ie: no "foo_t" field was 
> created)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SOLR-6856) regression in /update/extract ? ref guide examples of fmap & xpath don't seem to be working

Reply via email to