Re: RE : Re: Issue with apache Tika

2018-02-22 Thread Chris Mattmann
Great to hear!

 

 

From: radhia bezzine 
Date: Thursday, February 22, 2018 at 12:28 PM
To: Chris Mattmann 
Subject: Re: RE : Re: Issue with apache Tika

 

Hi Chris !  

 

I fixed the issue ! it was not so complicated ! a problem of version ! the 
recent version doesn t work for me but the version 1.15 works fine.

 

Thank you very much.

 

Good Night !

 

On Thu, Feb 22, 2018 at 6:42 PM, bezzineradhia  wrote:

Hello !

 

Thanks i ll try it tomorrow ! I ll let you know ! 

 

Regards !

Radhia

 

 

 

Envoyé depuis mon smartphone Samsung Galaxy.

 Message d'origine 

De : Chris Mattmann  

Date : 22/02/2018 18:31 (GMT+01:00) 

À : radhia bezzine  

Cc : dev@tika.apache.org 

Objet : Re: Issue with apache Tika 

 

Try UTF-8 encoding the URLs or the parameters themselves. If you are using 
Tika-Python, then use the Python
encode library…

 

Cheers,

Chris

 

 

 

From: radhia bezzine 
Date: Thursday, February 22, 2018 at 6:03 AM
To: "Mattmann, Chris A (1761)" 
Subject: Issue with apache Tika

 

Hello Dear ! 

 

I hope your are doing well.

 

I am writing to you because i have an issue running apache Tika on Python.

I'm trying to parse content & metadata from many urls (existing in the internet)

however Tika returns some times an error like " invalid argument "

i troubleshooted  the problem and i realized that some url include forbidden 
characters that is why apache tika mention " invalid argument "

I really don't know how to deal with this problem, i tried other tools but i 
think tika is matching with my need.

 

Thank you very much for you time.

 

Best regards! 

 

Radhia

 



Re: Issue with apache Tika

2018-02-22 Thread Chris Mattmann
Try UTF-8 encoding the URLs or the parameters themselves. If you are using 
Tika-Python, then use the Python
encode library…

 

Cheers,

Chris

 

 

 

From: radhia bezzine 
Date: Thursday, February 22, 2018 at 6:03 AM
To: "Mattmann, Chris A (1761)" 
Subject: Issue with apache Tika

 

Hello Dear ! 

 

I hope your are doing well.

 

I am writing to you because i have an issue running apache Tika on Python.

I'm trying to parse content & metadata from many urls (existing in the internet)

however Tika returns some times an error like " invalid argument "

i troubleshooted  the problem and i realized that some url include forbidden 
characters that is why apache tika mention " invalid argument "

I really don't know how to deal with this problem, i tried other tools but i 
think tika is matching with my need.

 

Thank you very much for you time.

 

Best regards! 

 

Radhia



[jira] [Commented] (TIKA-2580) SafeContentHandler documentation is incorrect about replacement character

2018-02-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372901#comment-16372901
 ] 

Hudson commented on TIKA-2580:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1439 (See 
[https://builds.apache.org/job/Tika-trunk/1439/])
Fix for TIKA-2580 contributed by ewanmellor. (commits: 
[https://github.com/apache/tika/commit/4fdc2b7247a033defed10874eecb986778ad451c])
* (edit) tika-core/src/main/java/org/apache/tika/sax/SafeContentHandler.java
TIKA-2580 (tallison: 
[https://github.com/apache/tika/commit/257ee161f9b6b9ea8a52dd52d71779479d9af254])
* (edit) tika-core/src/main/java/org/apache/tika/sax/SafeContentHandler.java


> SafeContentHandler documentation is incorrect about replacement character
> -
>
> Key: TIKA-2580
> URL: https://issues.apache.org/jira/browse/TIKA-2580
> Project: Tika
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 1.17
>Reporter: Ewan Mellor
>Priority: Minor
> Fix For: 1.18, 2.0.0
>
>
> SafeContentHandler's doc comment states "All invalid characters are replaced 
> with spaces."  This has been untrue since TIKA-698 (Sep 2011).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2580) SafeContentHandler documentation is incorrect about replacement character

2018-02-22 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2580.
---
   Resolution: Fixed
Fix Version/s: 2.0.0
   1.18

Thank you!

> SafeContentHandler documentation is incorrect about replacement character
> -
>
> Key: TIKA-2580
> URL: https://issues.apache.org/jira/browse/TIKA-2580
> Project: Tika
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 1.17
>Reporter: Ewan Mellor
>Priority: Minor
> Fix For: 1.18, 2.0.0
>
>
> SafeContentHandler's doc comment states "All invalid characters are replaced 
> with spaces."  This has been untrue since TIKA-698 (Sep 2011).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2580) SafeContentHandler documentation is incorrect about replacement character

2018-02-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372860#comment-16372860
 ] 

ASF GitHub Bot commented on TIKA-2580:
--

tballison closed pull request #220: Fix for TIKA-2580 contributed by ewanmellor.
URL: https://github.com/apache/tika/pull/220
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/tika-core/src/main/java/org/apache/tika/sax/SafeContentHandler.java 
b/tika-core/src/main/java/org/apache/tika/sax/SafeContentHandler.java
index d3152c680..f82098493 100644
--- a/tika-core/src/main/java/org/apache/tika/sax/SafeContentHandler.java
+++ b/tika-core/src/main/java/org/apache/tika/sax/SafeContentHandler.java
@@ -31,7 +31,8 @@
  * ({@link #characters(char[], int, int)} or
  * {@link #ignorableWhitespace(char[], int, int)}) passed to the decorated
  * content handler contain only valid XML characters. All invalid characters
- * are replaced with spaces.
+ * are replaced with the Unicode replacement character U+FFFD (though a
+ * subclass may change this by overriding the writeReplacement method).
  * 
  * The XML standard defines the following Unicode character ranges as
  * valid XML characters:


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> SafeContentHandler documentation is incorrect about replacement character
> -
>
> Key: TIKA-2580
> URL: https://issues.apache.org/jira/browse/TIKA-2580
> Project: Tika
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 1.17
>Reporter: Ewan Mellor
>Priority: Minor
>
> SafeContentHandler's doc comment states "All invalid characters are replaced 
> with spaces."  This has been untrue since TIKA-698 (Sep 2011).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)