[jira] [Commented] (ANY23-248) NTriplesWriter on hadoop : issue with MIME type

Peter Ansell (JIRA) Mon, 02 Feb 2015 18:41:48 -0800

    [ 
https://issues.apache.org/jira/browse/ANY23-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302682#comment-14302682
 ]


Peter Ansell commented on ANY23-248:
------------------------------------

Thanks for looking further into this Souri. If that was the fix, it is 
definitely related to classpath searching, not that some libraries were missing 
in the Maven dependencies. 

It is possible that Sesame is using a Classloader inside of the 
Rio.createParser method that doesn't have access to Semargl somehow, in the 
context of Hadoop. Currently it is using the actual classes classloader, which 
may not have a view of the Semargl jar file/classes in the Hadoop classloader 
model:

The parsers are found starting at the code here:

https://bitbucket.org/openrdf/sesame/src/6275c3e0d504df76edb16396c11e67f07c72439c/core/rio/api/src/main/java/org/openrdf/rio/RDFParserRegistry.java?at=2.7.x

Internally the constructor for that class gets down to the following code that 
ends up calling RDFParser.class.getClassLoader(), which may be not useful in 
your case:

https://bitbucket.org/openrdf/sesame/src/6275c3e0d504df76edb16396c11e67f07c72439c/core/util/src/main/java/info/aduna/lang/service/ServiceRegistry.java?at=2.7.x#cl-45

If there is a more durable method for that process that works on Hadoop then 
could you submit a pull request to the Sesame BitBucket and I will review it 
there. In particular, it may be possible to use the thread context class loader 
that may have more classes in scope at that point, but you would need to do 
some experimenting on Hadoop to see what classes it is able to find.

> NTriplesWriter on hadoop : issue with MIME type
> -----------------------------------------------
>
>                 Key: ANY23-248
>                 URL: https://issues.apache.org/jira/browse/ANY23-248
>             Project: Apache Any23
>          Issue Type: Bug
>    Affects Versions: 1.1
>         Environment: hadoop,linux
>            Reporter: Souri
>            Priority: Minor
>             Fix For: 1.2
>
>
> I am trying to create n-triples from an html string. I am using the following 
> code to do it:
> StringDocumentSource documentSource = new StringDocumentSource(html, null);
>             ByteArrayOutputStream out = new ByteArrayOutputStream();
>             final NTriplesWriter tripleHandler = new NTriplesWriter(out);
>             Any23 runner = new Any23();
>            
>             runner.extract(documentSource,tripleHandler);
>             tripleHandler.close();
>             String result = out.toString("us-ascii");
>             return result;
> This is giving me the error :
> java.lang.NullPointerException
>       at 
> org.apache.any23.extractor.SingleDocumentExtraction.filterExtractorsByMIMEType(SingleDocumentExtraction.java:421)
>       at 
> org.apache.any23.extractor.SingleDocumentExtraction.run(SingleDocumentExtraction.java:223)
>       at org.apache.any23.Any23.extract(Any23.java:298)
>       at org.apache.any23.Any23.extract(Any23.java:433)
> I am running this in hadoop. When I run locally with a single file it works, 
> but doesn't work when run on hadoop.
> Can someone please tell me how to go about this issue?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ANY23-248) NTriplesWriter on hadoop : issue with MIME type

Reply via email to