Re: RE : Re: Issue with apache Tika
Great to hear! From: radhia bezzineDate: Thursday, February 22, 2018 at 12:28 PM To: Chris Mattmann Subject: Re: RE : Re: Issue with apache Tika Hi Chris ! I fixed the issue ! it was not so complicated ! a problem of version ! the recent version doesn t work for me but the version 1.15 works fine. Thank you very much. Good Night ! On Thu, Feb 22, 2018 at 6:42 PM, bezzineradhia wrote: Hello ! Thanks i ll try it tomorrow ! I ll let you know ! Regards ! Radhia Envoyé depuis mon smartphone Samsung Galaxy. Message d'origine De : Chris Mattmann Date : 22/02/2018 18:31 (GMT+01:00) À : radhia bezzine Cc : dev@tika.apache.org Objet : Re: Issue with apache Tika Try UTF-8 encoding the URLs or the parameters themselves. If you are using Tika-Python, then use the Python encode library… Cheers, Chris From: radhia bezzine Date: Thursday, February 22, 2018 at 6:03 AM To: "Mattmann, Chris A (1761)" Subject: Issue with apache Tika Hello Dear ! I hope your are doing well. I am writing to you because i have an issue running apache Tika on Python. I'm trying to parse content & metadata from many urls (existing in the internet) however Tika returns some times an error like " invalid argument " i troubleshooted the problem and i realized that some url include forbidden characters that is why apache tika mention " invalid argument " I really don't know how to deal with this problem, i tried other tools but i think tika is matching with my need. Thank you very much for you time. Best regards! Radhia
Re: Issue with apache Tika
Try UTF-8 encoding the URLs or the parameters themselves. If you are using Tika-Python, then use the Python encode library… Cheers, Chris From: radhia bezzineDate: Thursday, February 22, 2018 at 6:03 AM To: "Mattmann, Chris A (1761)" Subject: Issue with apache Tika Hello Dear ! I hope your are doing well. I am writing to you because i have an issue running apache Tika on Python. I'm trying to parse content & metadata from many urls (existing in the internet) however Tika returns some times an error like " invalid argument " i troubleshooted the problem and i realized that some url include forbidden characters that is why apache tika mention " invalid argument " I really don't know how to deal with this problem, i tried other tools but i think tika is matching with my need. Thank you very much for you time. Best regards! Radhia
[jira] [Commented] (TIKA-2580) SafeContentHandler documentation is incorrect about replacement character
[ https://issues.apache.org/jira/browse/TIKA-2580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372901#comment-16372901 ] Hudson commented on TIKA-2580: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1439 (See [https://builds.apache.org/job/Tika-trunk/1439/]) Fix for TIKA-2580 contributed by ewanmellor. (commits: [https://github.com/apache/tika/commit/4fdc2b7247a033defed10874eecb986778ad451c]) * (edit) tika-core/src/main/java/org/apache/tika/sax/SafeContentHandler.java TIKA-2580 (tallison: [https://github.com/apache/tika/commit/257ee161f9b6b9ea8a52dd52d71779479d9af254]) * (edit) tika-core/src/main/java/org/apache/tika/sax/SafeContentHandler.java > SafeContentHandler documentation is incorrect about replacement character > - > > Key: TIKA-2580 > URL: https://issues.apache.org/jira/browse/TIKA-2580 > Project: Tika > Issue Type: Bug > Components: documentation >Affects Versions: 1.17 >Reporter: Ewan Mellor >Priority: Minor > Fix For: 1.18, 2.0.0 > > > SafeContentHandler's doc comment states "All invalid characters are replaced > with spaces." This has been untrue since TIKA-698 (Sep 2011). > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (TIKA-2580) SafeContentHandler documentation is incorrect about replacement character
[ https://issues.apache.org/jira/browse/TIKA-2580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2580. --- Resolution: Fixed Fix Version/s: 2.0.0 1.18 Thank you! > SafeContentHandler documentation is incorrect about replacement character > - > > Key: TIKA-2580 > URL: https://issues.apache.org/jira/browse/TIKA-2580 > Project: Tika > Issue Type: Bug > Components: documentation >Affects Versions: 1.17 >Reporter: Ewan Mellor >Priority: Minor > Fix For: 1.18, 2.0.0 > > > SafeContentHandler's doc comment states "All invalid characters are replaced > with spaces." This has been untrue since TIKA-698 (Sep 2011). > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2580) SafeContentHandler documentation is incorrect about replacement character
[ https://issues.apache.org/jira/browse/TIKA-2580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372860#comment-16372860 ] ASF GitHub Bot commented on TIKA-2580: -- tballison closed pull request #220: Fix for TIKA-2580 contributed by ewanmellor. URL: https://github.com/apache/tika/pull/220 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/tika-core/src/main/java/org/apache/tika/sax/SafeContentHandler.java b/tika-core/src/main/java/org/apache/tika/sax/SafeContentHandler.java index d3152c680..f82098493 100644 --- a/tika-core/src/main/java/org/apache/tika/sax/SafeContentHandler.java +++ b/tika-core/src/main/java/org/apache/tika/sax/SafeContentHandler.java @@ -31,7 +31,8 @@ * ({@link #characters(char[], int, int)} or * {@link #ignorableWhitespace(char[], int, int)}) passed to the decorated * content handler contain only valid XML characters. All invalid characters - * are replaced with spaces. + * are replaced with the Unicode replacement character U+FFFD (though a + * subclass may change this by overriding the writeReplacement method). * * The XML standard defines the following Unicode character ranges as * valid XML characters: This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > SafeContentHandler documentation is incorrect about replacement character > - > > Key: TIKA-2580 > URL: https://issues.apache.org/jira/browse/TIKA-2580 > Project: Tika > Issue Type: Bug > Components: documentation >Affects Versions: 1.17 >Reporter: Ewan Mellor >Priority: Minor > > SafeContentHandler's doc comment states "All invalid characters are replaced > with spaces." This has been untrue since TIKA-698 (Sep 2011). > -- This message was sent by Atlassian JIRA (v7.6.3#76005)