[ 
https://issues.apache.org/jira/browse/ANY23-280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16138824#comment-16138824
 ] 

ASF GitHub Bot commented on ANY23-280:
--------------------------------------

Github user lewismc commented on a diff in the pull request:

    https://github.com/apache/any23/pull/24#discussion_r134831365
  
    --- Diff: api/src/main/java/org/apache/any23/extractor/Extractor.java ---
    @@ -39,22 +38,6 @@
     
         /**
          * This interface specializes an {@link Extractor} able to handle
    -     * {@link java.io.InputStream} as input format.
    -     */
    -    public interface ContentExtractor extends Extractor<InputStream> {
    --- End diff --
    
    @jgrzebyta yes this is correct... we do not always wish to assume that the 
input is structured in XML or a subset thereof... syntax-strict extractors are 
prone to breakage. Our aim in Any23 should be to provide flexibility in the 
extraction logic rather than a strict, fragile extraction logic.


> Refactor ContentExtractor to improve extraction flexibility
> -----------------------------------------------------------
>
>                 Key: ANY23-280
>                 URL: https://issues.apache.org/jira/browse/ANY23-280
>             Project: Apache Any23
>          Issue Type: Improvement
>          Components: core, extractors
>    Affects Versions: 1.1
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Blocker
>             Fix For: 2.1
>
>
> As discussed on ANY23-247, the 
> [ContentExtractor|https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/api/src/main/java/org/apache/any23/extractor/Extractor.java#L44]
>  is simply not fit for purpose. This issue was discovered and the cause has 
> plagued our builds ever since. Any extractors which implement 
> [BaseRDFExtractor|https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/core/src/main/java/org/apache/any23/extractor/rdf/BaseRDFExtractor.java]
>  are based on the Extractor.ContentExtractor and hence work off of an 
> 'unfixed' raw data stream as oppose to a more flexible model such as the 
> [TagSoupDOMExtractor|https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/api/src/main/java/org/apache/any23/extractor/Extractor.java#L60].
> This issue should refactor RDF extractors to enable more flexibility and to 
> avoid issues we encounter with the strict SAX parsing logic.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to