[ 
https://jira.nuxeo.org/browse/NXP-5850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olivier Grisel resolved NXP-5850.
---------------------------------

    Resolution: Fixed

http://hg.nuxeo.org/nuxeo/nuxeo-core/rev/597f70a806ef

> Make the HTML to text converter keep the punctuation and some basic layout 
> information.
> ---------------------------------------------------------------------------------------
>
>                 Key: NXP-5850
>                 URL: https://jira.nuxeo.org/browse/NXP-5850
>             Project: Nuxeo Enterprise Platform
>          Issue Type: Bug
>          Components: Transforms / Preview
>    Affects Versions: 5.3.2
>            Reporter: Olivier Grisel
>            Assignee: Olivier Grisel
>            Priority: Major
>             Fix For: 5.4
>
>
> Most *_to_text transformer / converter (office files and PDF) implementations 
> extract a text representation that features the punctuation and overall 
> layout of the document.
> However this is no the case for HTML extractor. This is a problem for 
> semantic analysis of Note documents since most analysers requires access to 
> the punctuation to detect sentence boundaries. Furtheremore, the current 
> implementation extract the following markup:
> """
> <h1>This is the title</h1><p>This is the paragraph.</p>
> """
> as:
> """
> This is the titleThis is the paragraph
> """
> Which makes the fulltext indexer extract the token "titlethis" instead of 
> "title" and "this".
> Ideally this should be extracted as:
> """
> This is the title
> This is the paragraph.
> """

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://jira.nuxeo.org/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        
_______________________________________________
ECM-tickets mailing list
[email protected]
http://lists.nuxeo.com/mailman/listinfo/ecm-tickets

Reply via email to