[ 
https://issues.apache.org/jira/browse/TIKA-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Kincaid updated TIKA-2410:
-------------------------------
    Description: 
While parsing some RTF files I'm finding that the RTF parser tags many text 
spans as bold even if they are not. I am attaching a sample RTF file that 
exhibits this behavior. When parsing the file the first line is correctly 
tagged as bold. However the second line (the phone number) which is not 
supposed to be bold is tagged as bold.

The following code demonstrates the problem.

{code:java}
InputStream inputStream = Thread.currentThread().getContextClassLoader()
                .getResourceAsStream("sample-rtf.rtf");

Parser parser = new RTFParser();
ContentHandler contentHandler = new ToXMLContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();

parser.parse(inputStream, contentHandler, metadata, context);
String xml = contentHandler.toString();
{code}


  was:
While parsing some RTF files I'm finding that the RTF parser tags many text 
spans as bold even if they are not. I am attaching a sample RTF file that 
exhibits this behavior. When parsing the file the first line is correctly 
tagged as bold. However the second line (the phone number) which is not 
supposed to be bold is tagged as bold.

The following code demonstrates the problem.

{code:java}
        InputStream inputStream = Thread.currentThread().getContextClassLoader()
                .getResourceAsStream("sample-rtf.rtf");

        Parser parser = new RTFParser();
        ContentHandler contentHandler = new ToXMLContentHandler();
        Metadata metadata = new Metadata();
        ParseContext context = new ParseContext();

        parser.parse(inputStream, contentHandler, metadata, context);
        String xml = contentHandler.toString();
{code}



> RTF parser is tagging non-bold text as bold
> -------------------------------------------
>
>                 Key: TIKA-2410
>                 URL: https://issues.apache.org/jira/browse/TIKA-2410
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Dave Kincaid
>         Attachments: sample-rtf.rtf
>
>
> While parsing some RTF files I'm finding that the RTF parser tags many text 
> spans as bold even if they are not. I am attaching a sample RTF file that 
> exhibits this behavior. When parsing the file the first line is correctly 
> tagged as bold. However the second line (the phone number) which is not 
> supposed to be bold is tagged as bold.
> The following code demonstrates the problem.
> {code:java}
> InputStream inputStream = Thread.currentThread().getContextClassLoader()
>                 .getResourceAsStream("sample-rtf.rtf");
> Parser parser = new RTFParser();
> ContentHandler contentHandler = new ToXMLContentHandler();
> Metadata metadata = new Metadata();
> ParseContext context = new ParseContext();
> parser.parse(inputStream, contentHandler, metadata, context);
> String xml = contentHandler.toString();
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to