Make the HTML to text converter keep the punctuation and some basic layout 
information.
---------------------------------------------------------------------------------------

                 Key: NXP-5850
                 URL: https://jira.nuxeo.org/browse/NXP-5850
             Project: Nuxeo Enterprise Platform
          Issue Type: Bug
          Components: Transforms / Preview
    Affects Versions: 5.3.2
            Reporter: Olivier Grisel
            Assignee: Olivier Grisel
            Priority: Major
             Fix For: 5.4


Most *_to_text transformer / converter (office files and PDF) implementations 
extract a text representation that features the punctuation and overall layout 
of the document.

However this is no the case for HTML extractor. This is a problem for semantic 
analysis of Note documents since most analysers requires access to the 
punctuation to detect sentence boundaries. Furtheremore, the current 
implementation extract the following markup:

"""
<h1>This is the title</h1><p>This is the paragraph.</p>
"""

as:

"""
This is the titleThis is the paragraph
"""

Which makes the fulltext indexer extract the token "titlethis" instead of 
"title" and "this".

Ideally this should be extracted as:

"""
This is the title

This is the paragraph.
"""

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://jira.nuxeo.org/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        
_______________________________________________
ECM-tickets mailing list
[email protected]
http://lists.nuxeo.com/mailman/listinfo/ecm-tickets

Reply via email to