Hi Jairo,

I created STANBOL-813 [1] and implemented a fix with revision [2].
Your test case now works for me so it should be fine for you to.

Note that this fix does not tackle the general issues as mentioned in
my first replay so Stanbol might still write characters to the
Enhancement structure that might cause "application/rdf+xml"
serializations to fail.

best
Rupert



[1] https://issues.apache.org/jira/browse/STANBOL-813
[2] http://svn.apache.org/viewvc?rev=1412756&view=rev

On Tue, Nov 20, 2012 at 6:19 AM, Rupert Westenthaler
<[email protected]> wrote:
> Hi Jairo,
>
> This is caused by the "removeNonUtf8CompliantCharacters(..)" in the
> NEREngineCore class (OpenNLP-NER engine) [1]. The JavaDoc says that
> this was added to avoid errors while creating "application/rdf+xml"
> responses.
>
> I am only recently noticed this method as I adapted the OpenNLP NER
> engine to work with the new Stanbol NLP processing chain
> (STANBOL-797). In the branch version of this engine [2] this method
> the
> "removeNonUtf8CompliantCharacters(..)" is no longer called if the
> AnalyzedText ContentPart (STANBOL-734) is used as source for the
> enhancements.
>
> Generally I do not like this method as it creates a copy of the parsed
> content what can be a problem for big texts. In addition as this is
> only done by this engine there is still no guarantee that there are no
> non UTF-8 compliant chars in the response (they might even come from
> literals in dereferenced Entities).
>
> In addition this method seams to be overdoing as well, because the 'í'
> in 'París' is clearly an UTF-8 conform character.  Maybe Olivier
> Grisel can comment to that, because as far as I can remember he was
> the one adding this feature years ago.
>
> best
> Rupert
>
>
> [1] 
> http://svn.apache.org/repos/asf/stanbol/trunk/enhancer/engines/opennlp-ner/src/main/java/org/apache/stanbol/enhancer/engines/opennlp/impl/NEREngineCore.java
> [2] 
> http://svn.apache.org/repos/asf/stanbol/branches/stanbol-nlp-processing/enhancer/engines/opennlp-ner/src/main/java/org/apache/stanbol/enhancer/engines/opennlp/impl/NEREngineCore.java
>
> On Mon, Nov 19, 2012 at 7:01 PM, Jairo Sarabia
> <[email protected]> wrote:
>> Hi Rupert,
>>
>> I tried to use enhancer service for spanish texts and I have problems with
>> codification.
>> In the service, the  caracters with accents disappear in json response and
>> consequently there are important words of de Language that no appear in the
>> responses.
>> I've tried using different codifications in the requests but none seem to
>> work:
>>
>> Examples of Headers:
>> 1)  -H "Accept: application/json", "Content-type: text/plain"
>> 2)  -H "Accept: application/json", "Content-type: text/plain; charset=utf-8"
>> 3)  -H "Accept: application/json", "Content-type: text/plain;
>> charset=iso-8859-1"
>> 4) -H "Accept: application/json", "Content-type: text/html; charset=utf-8",
>> "Accept-Language: es-es"
>> 5) -H "Accept: application/json", "Content-type: text/html;
>> charset=iso-8859-1", "Accept-Language: es-es"
>>
>> Example of curl request:
>>
>> REQUEST:
>>
>> curl -v -X POST -H "Accept: text/plain" -H "Content-type: text/html;
>> charset=utf-8" -H "Accept-language:es-es;en" --data "<html><body><p>The
>> Stanbol enhancer puede detectar personas famosas como Mariano Rajoy y
>> ciudades como París.</p></body></html>"
>> "http://ec2-50-16-118-169.compute-1.amazonaws.com:8080/enhancer/chain/notedlinks";
>>
>> JSON RESPONSE:
>>
>> {
>>  ....
>>
>>     {
>>       "@subject":
>> "urn:content-item-sha1-69a7889f31ea325dda4a9e08f735b1499e7d6e3c",
>>       "dc:format": "text/html; charset=UTF-8",
>>       "http://www.w3.org/ns/ma-ont#hasFormat": "text/html; charset=UTF-8"
>>     },
>>     {
>>       "@subject": "urn:enhancement-0367734f-e48d-4dc3-e634-e5a3a4770706",
>>       "@type": [
>>         "enhancer:Enhancement",
>>         "enhancer:TextAnnotation"
>>       ],
>>       "dc:created": "2012-11-19T17:48:25.977Z",
>>       "dc:creator":
>> "org.apache.stanbol.enhancer.engines.opennlp.impl.NamedEntityExtractionEnhancementEngine",
>>       "dc:type": "dbp-ont:Person",
>>       "enhancer:confidence": 0.98616,
>>       "enhancer:end": 71,
>>       "enhancer:extracted-from":
>> "urn:content-item-sha1-69a7889f31ea325dda4a9e08f735b1499e7d6e3c",
>>       "enhancer:selected-text": {
>>         "@language": "es",
>>         "@literal": "Mariano Rajoy"
>>       },
>>       "enhancer:selection-context": {
>>         "@language": "es",
>>         "@literal": "The Stanbol enhancer puede detectar personas famosas
>> como Mariano Rajoy y ciudades como Par  s"
>>       },
>>       "enhancer:start": 58
>>     },
>>     {
>>       "@subject": "urn:enhancement-349dbd8a-6e8a-e6aa-4101-82cdc9b9f44e",
>>       "@type": [
>>         "enhancer:Enhancement",
>>         "enhancer:TextAnnotation"
>>       ],
>>       "dc:created": "2012-11-19T17:48:25.906Z",
>>       "dc:creator":
>> "org.apache.stanbol.enhancer.engines.langdetect.LanguageDetectionEnhancementEngine",
>>       "dc:language": "es",
>>       "dc:type": "dc:LinguisticSystem",
>>       "enhancer:confidence": 0.99999565,
>>       "enhancer:extracted-from":
>> "urn:content-item-sha1-69a7889f31ea325dda4a9e08f735b1499e7d6e3c"
>>     },
>>     {
>>       "@subject": "urn:enhancement-4410ee09-dc9c-4d5f-0e2c-a269fe3658fc",
>>       "@type": [
>>         "enhancer:Enhancement",
>>         "enhancer:EntityAnnotation"
>>       ],
>>       "dc:created": "2012-11-19T17:48:25.985Z",
>>       "dc:creator":
>> "org.apache.stanbol.enhancer.engines.entitytagging.impl.NamedEntityTaggingEngine",
>>       "dc:relation": "urn:enhancement-b650a7ef-c6ee-248f-57e0-754f60af9b55",
>>       "enhancer:confidence": 0.12,
>>       "enhancer:entity-label": {
>>         "@language": "es",
>>         "@literal": "Bollullos Par del Condado"
>>       },
>>       "enhancer:entity-reference":
>> "http://es.dbpedia.org/resource/Bollullos_Par_del_Condado";,
>>       "enhancer:entity-type": [
>>         "dbp-ont:AdministrativeRegion",
>>         "schema:AdministrativeArea",
>>         "dbp-ont:PopulatedPlace",
>>         "schema:Place",
>>         "dbp-ont:Place",
>>         "owl:Thing"
>>       ],
>>       "enhancer:extracted-from":
>> "urn:content-item-sha1-69a7889f31ea325dda4a9e08f735b1499e7d6e3c",
>>       "entityhub:site": "dbpedia"
>>     },
>>     {
>>       "@subject": "urn:enhancement-8cfad12b-9301-7922-37ba-f65a4ad6ab6f",
>>       "@type": [
>>         "enhancer:Enhancement",
>>         "enhancer:EntityAnnotation"
>>       ],
>>       "dc:created": "2012-11-19T17:48:25.985Z",
>>       "dc:creator":
>> "org.apache.stanbol.enhancer.engines.entitytagging.impl.NamedEntityTaggingEngine",
>>       "dc:relation": "urn:enhancement-0367734f-e48d-4dc3-e634-e5a3a4770706",
>>       "enhancer:confidence": 1.0,
>>       "enhancer:entity-label": {
>>         "@language": "es",
>>         "@literal": "Mariano Rajoy"
>>       },
>>       "enhancer:entity-reference":
>> "http://es.dbpedia.org/resource/Mariano_Rajoy";,
>>       "enhancer:entity-type": [
>>         "foaf:Person",
>>         "schema:Person",
>>         "dbp-ont:Person",
>>         "dbp-ont:Agent",
>>         "dbp-ont:President",
>>         "dbp-ont:Politician",
>>         "owl:Thing"
>>       ],
>>       "enhancer:extracted-from":
>> "urn:content-item-sha1-69a7889f31ea325dda4a9e08f735b1499e7d6e3c",
>>       "entityhub:site": "dbpedia"
>>     },
>>     {
>>       "@subject": "urn:enhancement-b650a7ef-c6ee-248f-57e0-754f60af9b55",
>>       "@type": [
>>         "enhancer:Enhancement",
>>         "enhancer:TextAnnotation"
>>       ],
>>       "dc:created": "2012-11-19T17:48:25.979Z",
>>       "dc:creator":
>> "org.apache.stanbol.enhancer.engines.opennlp.impl.NamedEntityExtractionEnhancementEngine",
>>       "dc:type": "dbp-ont:Place",
>>       "enhancer:confidence": 0.8029361,
>>       "enhancer:end": 91,
>>       "enhancer:extracted-from":
>> "urn:content-item-sha1-69a7889f31ea325dda4a9e08f735b1499e7d6e3c",
>>       "enhancer:selected-text": {
>>         "@language": "es",
>>         "@literal": "Par"
>>       },
>>       "enhancer:selection-context": {
>>         "@language": "es",
>>         "@literal": "The Stanbol enhancer puede detectar personas famosas
>> como Mariano Rajoy y ciudades como Par  s"
>>       },
>>       "enhancer:start": 88
>>     },
>>     {
>>       "@subject": "urn:enhancement-d3e35917-ab84-05c1-2c2e-2a620f4976f4",
>>       "@type": [
>>         "enhancer:Enhancement",
>>         "enhancer:TextAnnotation"
>>       ],
>>       "dc:created": "2012-11-19T17:48:25.974Z",
>>       "dc:creator":
>> "org.apache.stanbol.enhancer.engines.langid.LangIdEnhancementEngine",
>>       "dc:language": "gl",
>>       "dc:type": "dc:LinguisticSystem",
>>       "enhancer:extracted-from":
>> "urn:content-item-sha1-69a7889f31ea325dda4a9e08f735b1499e7d6e3c"
>>     }
>>   ]
>>
>> In the example above, you can see that the "París" word of text have letter
>> i with accent. So in the response, letter "í" disappear of the text,
>> consequently the word "París" becomes "Par s", and so the response did not
>> find the concept "Paris, capital of France"
>> ("http://es.dbpedia.org/resource/París";).
>>
>> I'll thank you to tell me how I can solve this problem with the texts in
>> Spanish
>>
>> Best,
>> Jairo
>
>
>
> --
> | Rupert Westenthaler             [email protected]
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen



-- 
| Rupert Westenthaler             [email protected]
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Reply via email to