Re: Post link to Tika in Action book on Tika website?

2010-08-02 Thread Oleg Tikhonov
+1, positively.


On Mon, Aug 2, 2010 at 8:33 PM, Mattmann, Chris A (388J) 
chris.a.mattm...@jpl.nasa.gov wrote:

 Hi Tika community,

 Jukka Zitting and I are working on the Tika in Action book [1]. How would
 everyone feel about us posting a link to it on the Tika website [2]?

 If so, I'll prepare a patch and update the website shortly.

 Cheers,
 Chris

 [1] http://manning.com/mattmann/
 [2] http://tika.apache.org/


 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.mattm...@jpl.nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/http://sunset.usc.edu/%7Emattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++





-- 
Best regards, Oleg.


Re: [jira] Commented: (TIKA-492) Add language identification support for North Sami, Lule Sami and South Sami

2010-08-24 Thread Oleg Tikhonov
Hi Ken,
I used Nutch's LanguageProfiler in order to produce language profile.
More about this issue you can find:
http://www.ibm.com/developerworks/opensource/tutorials/os-apache-tika/authors.html
(It's not self - promoting !)
Download the sources, using ant task you'll able to create lang profile.
If you need any help, please do not hesitate to ask.


BR,
Oleg.

2010/8/24 Jan Høydahl (JIRA) j...@apache.org


[
 https://issues.apache.org/jira/browse/TIKA-492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901900#action_12901900]

 Jan Høydahl commented on TIKA-492:
 --

 I'm in the process of gathering enough text content for the profiles.

 I also posted a question to the user list to ask what tool/process you use
 to generate profiles but did not see an answer yet.

  Add language identification support for North Sami, Lule Sami and South
 Sami
 
 
 
  Key: TIKA-492
  URL: https://issues.apache.org/jira/browse/TIKA-492
  Project: Tika
   Issue Type: New Feature
   Components: languageidentifier
 Affects Versions: 0.7
 Reporter: Jan Høydahl
 Assignee: Ken Krugler
 Priority: Minor
 
  We need added support for Sami languages.
  According to document Requirements for support for Sami languages in
 data processing (http://www.samit.no/01-850-51.pdf) Tika will get Basic
 Level support by detecting North Sami, Lule Sami and South Sami.

 --
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.




-- 
Best regards, Oleg.


Re: Error thrown with TikaConfig() constructor

2010-09-12 Thread Oleg Tikhonov
There are the situations, I could think about, where you would like to
implement customized classloader:
1. You need different hierarchy to load classes, as OSGi for instance.
Hollywood principle if you like.
2. When you need to run different versions of classes or jars. For example,
you want to load class A with version 1.1.2, while class B needs version
2.3.4.
3. At runtime you need to edit byte-code of class and reload it.
4. Most obviously, you need to load class/es from network, default
classloader loads classes that placed locally
5. Dynamically create classes and load them on the fly
6. Run multiple java applications inside a single JVM

BR,
Oleg.


On Sun, Sep 12, 2010 at 4:46 PM, Ken Krugler kkrugler_li...@transpac.comwrote:


 On Sep 11, 2010, at 1:17pm, Ken Krugler wrote:

  On Fri, Sep 10, 2010 at 10:31 PM, Nick Burch nick.bu...@alfresco.com
 wrote:

 Quite a lot of OfficeParser does depend on poifs code though, as well as
 a
 few bits that depend on some of the less common POI text extractors.


 It looks like a number of our other new parsers also have direct
 dependencies to external libraries, so this problem is not just
 related to the OfficeParser class.

 The basic problem here is that the service loader used by the default
 TikaConfig constructor throws an exception when it can't load a class
 listed in a org.apache.tika.parser.Parser service file. The solution I
 implemented in TIKA-378 for the 0.7 release was to move the external
 parser library references to separate extractor classes so that the
 parser class could be instantiated without problems. Unfortunately
 this was a one-off solution that obviously hasn't survived further
 development in the svn trunk.

 The reason why I originally didn't want to simply catch and ignore the
 potential exceptions in the TikaConfig constructor was the lack of a
 good error reporting mechanism. The trick of insulating the external
 library dependencies to separate extractor classes nicely solved that
 problem by delaying the exceptions to the actual parse() method calls
 on specific document types, which obviously would then give the end
 user a much better idea of what's wrong.

 Perhaps the best solution would actually be to combine the above
 approaches, i.e. to strive to maintain the parser/extractor separation
 where possible and to use a catch block in the TikaConfig constructor
 to catch and ignore any problems that the insulation approach fails to
 address.


 IIRC, the main concern about this approach is when people are using custom
 parsers, where instantiation exceptions can happen due to bugs in the actual
 parser (versus explicitly excluded jars). Quietly ignoring these errors
 leads to late failing, which can be a bad thing.

 What I would propose is two changes:

 1. Add a new TikaConfig(ClassLoader, ClassParser...) constructor that
 can be used to instantiate all parsers from the variable list that around
 found using the ClassLoader. For example:

   public TikaConfig(ClassLoader loader, ClassParser...targetParsers)
   throws MimeTypeException, IOException {
   for (ClassParser parserClass : targetParsers) {
   ParseContext context = new ParseContext();

   try {
   Parser parser = parserClass.newInstance();
   for (MediaType type : parser.getSupportedTypes(context)) {
   parsers.put(type, parser);
   }
   } catch (InstantiationException e) {
   throw new IOException(e);
   } catch (IllegalAccessException e) {
   throw new IOException(e);
   }
   }

   mimeTypes = MimeTypesFactory.create(tika-mimetypes.xml);
   }


 So after looking again at the code snippet I threw together above, it's not
 using the provided Classloader. I could iterate over parsers and
 catch/ignore errors to parsers not in the provided list, but that seems less
 than clean.

 I don't have much experience with classloaders - I see that each instance
 of a Class has a classloader associated with it, to mapping from its
 classload to the provided classloader would need something like:

ClassParser resolvedClass =
 (ClassParser)loader.loadClass(parserClass.getCanonicalName());
Parser parser = resolvedClass.newInstance();

 But that also seems clunky. Any other suggestions?

 As an aside, what's the standard use case for specifying an explicit
 classloader? I haven't seen this used in other projects, so I'm curious.

 Thanks,


 -- Ken

 --
 Ken Krugler
 +1 530-210-6378
 http://bixolabs.com
 e l a s t i c   w e b   m i n i n g








-- 
Best regards, Oleg.


Re: [jira] Commented: (TIKA-593) Tika network server

2011-02-08 Thread Oleg Tikhonov
Why do not use:
http://felix.apache.org/site/apache-felix-http-service.html



On Tue, Feb 8, 2011 at 5:06 PM, Chris A. Mattmann (JIRA) j...@apache.orgwrote:


[
 https://issues.apache.org/jira/browse/TIKA-593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991997#comment-12991997]

 Chris A. Mattmann commented on TIKA-593:
 

 My 2c on this:

 +1 to using JAX RS

 RE: the actual implementation, I used Apache CXF for OODT and it's
 basically a jar drop in (or MVN pom.xml update) single dependency. Wink is
 still Incubating right?

  Tika network server
  ---
 
  Key: TIKA-593
  URL: https://issues.apache.org/jira/browse/TIKA-593
  Project: Tika
   Issue Type: New Feature
   Components: general
 Reporter: Jukka Zitting
 Assignee: Jukka Zitting
 
  It would be cool to be able to run Tika as a network service that accepts
 a binary document as input and produces the extracted content (as XHTML,
 text, or just metadata) as output. A bit like TIKA-169, but without the
 dependency to a servlet container.
  I'd like to be able to set up and run such a server like this:
  $ java -jar tika-app.jar --port 1234
  We should also add a NetworkParser class that acts as a local client for
 such a service. This way a lightweight client could use the full set of Tika
 parsing functionality even with just the tika-core jar within its classpath.

 --
 This message is automatically generated by JIRA.
 -
 For more information on JIRA, see: http://www.atlassian.com/software/jira





-- 
Best regards, Oleg.


Re: [jira] [Commented] (TIKA-245) Support of CHM Format

2011-03-31 Thread Oleg Tikhonov
Hello Tran Nam Quang,

It uses CHMLIB C library, i.e. JNI. From my previous experience, it works
for limited amount of os'es. It does not work in Solaris or AIX.
The really good library with limitations mentioned above is
http://sevenzipjbind.sourceforge.net/ and also LGPL (I would say, the best
one).

BR,
Oleg

On Thu, Mar 31, 2011 at 8:12 PM, Tran Nam Quang (JIRA) j...@apache.orgwrote:


[
 https://issues.apache.org/jira/browse/TIKA-245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13014110#comment-13014110]

 Tran Nam Quang commented on TIKA-245:
 -

 Hello guys,

 Here's another CHM library for Java, licensed under the LGPL:
 http://sourceforge.net/projects/chm4j/

 Best regards
 Tran Nam Quang

  Support of CHM Format
  -
 
  Key: TIKA-245
  URL: https://issues.apache.org/jira/browse/TIKA-245
  Project: Tika
   Issue Type: New Feature
   Components: parser
  Environment: All
 Reporter: Karl Heinz Marbaise
 Priority: Minor
  Attachments: TIKA-245.tikhonov.20103107.patch.txt,
 TIKA-245.tikhonov.20112603.txt, TIKA-245.tikhonov.20112703.txt
 
 
  It might be a good idea to support the CHM File format of Windows. Some
 information about
 http://en.wikipedia.org/wiki/Microsoft_Compiled_HTML_Help#Extracting_to_HTML.
 The CHM format contains HTML files which can be parsed by Tika. So the
 only problem is to extract the data from the CHM file.

 --
 This message is automatically generated by JIRA.
 For more information on JIRA, see: http://www.atlassian.com/software/jira




-- 
Best regards, Oleg.


Re: [jira] [Commented] (TIKA-546) Add ability to create language profiles to tika-app

2011-04-14 Thread Oleg Tikhonov
Sami,
Chris and me, some time ago did that for developerWorks tutorial, the
clean code exist, although may be out of day.
I thought, is it good idea to use Nutch code inside Tika? Might be Nutch
guys could extend it as independent module?



On Thu, Apr 14, 2011 at 3:01 PM, Sami Siren (JIRA) j...@apache.org wrote:


[
 https://issues.apache.org/jira/browse/TIKA-546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13019793#comment-13019793]

 Sami Siren commented on TIKA-546:
 -

 bq. Do we build the LanguageProfilerBuilder from Nutch code here locally
 and ship it as binary package/library or as part of mvn install task/ ant
 task?

 I would just do what Jan suggested = get the relevant source files from
 Nutch, modify them as needed (like remove dependencies etc) and commit this
 into Tika svn repository.



  Add ability to create language profiles to tika-app
  ---
 
  Key: TIKA-546
  URL: https://issues.apache.org/jira/browse/TIKA-546
  Project: Tika
   Issue Type: New Feature
   Components: cli, languageidentifier
 Affects Versions: 0.7
 Reporter: Jan Høydahl
 
  Since TIKA-490 it is supposed to be easy adding new language profiles to
 TIKA. However, currently the process involves using Nutch's NGramProfile
 tool and editing the output.
  We should port Nutch's profile builder to Tika and make it part of
 tika-app.jar:
  # See http://wiki.apache.org/nutch/LanguageIdentifier
  # java -jar tika-app.jar --create-profile [--gramsizes=n,n,...]
 [--maxlines=max] profile-name filename encoding
  Using --gramsizes and --maxlines, we could support both Tika-style
 profiles and Nutch-style profiles and thus deprecate the Nutch tool.
 Defaults should be --gramsizes=3 --maxlines=1000

 --
 This message is automatically generated by JIRA.
 For more information on JIRA, see: http://www.atlassian.com/software/jira



Re: [jira] [Commented] (TIKA-245) Support of CHM Format

2011-06-07 Thread Oleg Tikhonov
Hi Chris,

I've applied the patch to the
tika-parsers/src/main/java/org/apache/tika/parser/chm, also added 3 chm
files to the tika-parsers\src\test\resources\test-documents and the tests.

BR,
Oleg

On Sun, Jun 5, 2011 at 1:32 AM, Chris A. Mattmann (JIRA) j...@apache.orgwrote:


[
 https://issues.apache.org/jira/browse/TIKA-245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13044403#comment-13044403]

 Chris A. Mattmann commented on TIKA-245:
 

 Hi Oleg,

 Looking over this patch, I have a few recommendations:

 # the patch should be applied to the Tika source tree format (e.g.,
 tika-parsers/src/main/java/org/apache/tika/parsers/chm)
 # Many of the class-top-level comments can probably be removed and thrown
 up on the Tika Wiki
 # it would be nice to include at least a unit test or 2 to know this is
 working. It's a huge patch, and I don't have a lot of CHM files to test it
 out on (being a Mac guy :-) )

 Cheers,
 Chris



  Support of CHM Format
  -
 
  Key: TIKA-245
  URL: https://issues.apache.org/jira/browse/TIKA-245
  Project: Tika
   Issue Type: New Feature
   Components: parser
  Environment: All
 Reporter: Karl Heinz Marbaise
 Priority: Minor
  Attachments: TIKA-245.tikhonov.04082011.patch.txt,
 TIKA-245.tikhonov.20103107.patch.txt, TIKA-245.tikhonov.20112603.txt,
 TIKA-245.tikhonov.20112703.txt
 
 
  It might be a good idea to support the CHM File format of Windows. Some
 information about
 http://en.wikipedia.org/wiki/Microsoft_Compiled_HTML_Help#Extracting_to_HTML.
 The CHM format contains HTML files which can be parsed by Tika. So the
 only problem is to extract the data from the CHM file.

 --
 This message is automatically generated by JIRA.
 For more information on JIRA, see: http://www.atlassian.com/software/jira



Re: [jira] [Assigned] (TIKA-245) Support of CHM Format

2011-06-07 Thread Oleg Tikhonov
Thank you Chris and Jukka!

I tried to keep the KISS principle, but couldn't.


On Tue, Jun 7, 2011 at 6:49 PM, Chris A. Mattmann (JIRA) j...@apache.orgwrote:


 [
 https://issues.apache.org/jira/browse/TIKA-245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]

 Chris A. Mattmann reassigned TIKA-245:
 --

Assignee: Chris A. Mattmann

  Support of CHM Format
  -
 
  Key: TIKA-245
  URL: https://issues.apache.org/jira/browse/TIKA-245
  Project: Tika
   Issue Type: New Feature
   Components: parser
  Environment: All
 Reporter: Karl Heinz Marbaise
 Assignee: Chris A. Mattmann
 Priority: Minor
  Fix For: 1.0
 
  Attachments: TIKA-245.tikhonov.04082011.patch.txt,
 TIKA-245.tikhonov.20103107.patch.txt, TIKA-245.tikhonov.20112603.txt,
 TIKA-245.tikhonov.20112703.txt
 
 
  It might be a good idea to support the CHM File format of Windows. Some
 information about
 http://en.wikipedia.org/wiki/Microsoft_Compiled_HTML_Help#Extracting_to_HTML.
 The CHM format contains HTML files which can be parsed by Tika. So the
 only problem is to extract the data from the CHM file.

 --
 This message is automatically generated by JIRA.
 For more information on JIRA, see: http://www.atlassian.com/software/jira




-- 
Best regards, Oleg.


support of Java 5

2011-06-08 Thread Oleg Tikhonov
Hello all,

As you may know, Oracle announced Java 5 SE EOL (End Of Life) since 2009 .
However, we are still supporting Java 5 SE. What is a rational behind the
walls? Why we encourage our costumers do not upgrade to the more modern
version(s) of Java?
Developing new products/features we cannot benefit from sophisticated
solutions which modern Java gives us.

Kind regards,

Oleg


Re: Build failed in Jenkins: Tika-trunk #563

2011-06-08 Thread Oleg Tikhonov
Chris, Nick,

I've attached the patch, hope now it will work/compile.

BR,
Oleg

On Wed, Jun 8, 2011 at 6:09 PM, Mattmann, Chris A (388J) 
chris.a.mattm...@jpl.nasa.gov wrote:

 Hi Nick,

 Yep, didn't even catch it on commit. Oleg emailed me offlist (which I asked
 him to bring onlist) and caught it too. He's offered to fix the issue. If he
 doesn't get to it, I will take a look tonight or tomorrow...

 Cheers,
 Chris

 On Jun 8, 2011, at 8:05 AM, Nick Burch wrote:

  [INFO] -
  [ERROR] COMPILATION ERROR :
  [INFO] -
  [ERROR]
 https://builds.apache.org/job/Tika-trunk/ws/trunk/tika-parsers/src/main/java/org/apache/tika/parser/chm/core/ChmCommons.java:[110,32]
 cannot find symbol
  symbol  : method copyOfRange(byte[],int,int)
  location: class java.util.Arrays
  [ERROR]
 https://builds.apache.org/job/Tika-trunk/ws/trunk/tika-parsers/src/main/java/org/apache/tika/parser/chm/core/ChmCommons.java:[211,77]
 cannot find symbol
  symbol  : method isEmpty()
  location: class java.lang.String
  [ERROR]
 https://builds.apache.org/job/Tika-trunk/ws/trunk/tika-parsers/src/main/java/org/apache/tika/parser/chm/lzx/ChmLzxBlock.java:[836,54]
 cannot find symbol
  symbol  : method copyOfRange(byte[],int,int)
  location: class java.util.Arrays
 
  Looks like there are maybe some 1.6isms in the CHM code?
 
  Nick


 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++




Re: Build failed in Jenkins: Tika-trunk #563

2011-06-08 Thread Oleg Tikhonov
Hi Jukka,
no problem at all.
I'll reformat and commit tomorrow then.

BR,
Oleg

On Wed, Jun 8, 2011 at 9:56 PM, Jukka Zitting jukka.zitt...@gmail.comwrote:

 Hi Oleg,

 On Wed, Jun 8, 2011 at 8:20 PM, Oleg Tikhonov o...@apache.org wrote:
  I've attached the patch, hope now it will work/compile.

 Feel free to commit it directly.

 BTW, in Tika we've generally tried to use only spaces for indentation.
 If it's not too much trouble, it would be nice if you could use the
 same settings for Tika code.

 BR,

 Jukka Zitting




-- 
Best regards, Oleg.


Re: Build failed in Jenkins: Tika-trunk #564

2011-06-09 Thread Oleg Tikhonov
Jukka,

Committed revision 1133955.

BR,
Oleg

On Thu, Jun 9, 2011 at 11:52 AM, Jukka Zitting jukka.zitt...@gmail.comwrote:

 Hi,

 On Thu, Jun 9, 2011 at 4:20 AM, Apache Jenkins Server
 jenk...@builds.apache.org wrote:
  cause : Too many unapproved licenses: 4

 The following files in tika-parsers are missing Apache license headers:

src/main/java/org/apache/tika/parser/chm/core/ChmWrapper.java
src/test/java/org/apache/tika/parser/chm/TestPmglHeader.java
src/test/java/org/apache/tika/parser/chm/TestPmgiHeader.java
src/test/java/org/apache/tika/parser/chm/TestChmDocumentInformation.java

 Oleg, can you add the missing headers? For future records it's best if
 the author of the files adds the headers.

 BR,

 Jukka Zitting



Re: Build failed in Jenkins: Tika-trunk #580

2011-07-18 Thread Oleg Tikhonov
Good evening,
What are the files that cannot pass the rat scanning?

Thanks in advance,
Oleg

On Mon, Jul 18, 2011 at 10:55 PM, Apache Jenkins Server 
jenk...@builds.apache.org wrote:

 See https://builds.apache.org/job/Tika-trunk/580/changes

 Changes:

 [kkrugler] Add quick test to validate that RSS feeds will be processed by
 the appropriate parser (see
 https://issues.apache.org/jira/browse/NUTCH-1053).

 --
 [...truncated 336 lines...]
 Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.092 sec
 Running org.apache.tika.parser.dwg.DWGParserTest
 Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.015 sec
 Running org.apache.tika.parser.asm.ClassParserTest
 Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.067 sec
 Running org.apache.tika.parser.audio.MidiParserTest
 Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.035 sec
 Running org.apache.tika.parser.audio.AudioParserTest
 Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.094 sec
 Running org.apache.tika.parser.mp3.Mp3ParserTest
 Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.219 sec
 Running org.apache.tika.parser.xml.FictionBookParserTest
 Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.267 sec
 Running org.apache.tika.parser.xml.DcXMLParserTest
 Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec
 Running org.apache.tika.parser.pdf.PDFParserTest
 Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.884 sec
 Running org.apache.tika.parser.image.MetadataFieldsTest
 Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec
 Running org.apache.tika.parser.image.ImageMetadataExtractorTest
 Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.075 sec
 Running org.apache.tika.parser.image.xmp.JempboxExtractorTest
 Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.008 sec
 Running org.apache.tika.parser.image.TiffParserTest
 Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.008 sec
 Running org.apache.tika.parser.image.ImageParserTest
 Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.07 sec
 Running org.apache.tika.parser.odf.ODFParserTest
 Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.062 sec
 Running org.apache.tika.parser.microsoft.POIContainerExtractionTest
 Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.826 sec
 Running org.apache.tika.parser.microsoft.PublisherParserTest
 Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.01 sec
 Running org.apache.tika.parser.microsoft.WriteProtectedParserTest
 Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.778 sec
 Running org.apache.tika.parser.microsoft.TNEFParserTest
 Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.158 sec
 Running org.apache.tika.parser.microsoft.VisioParserTest
 Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.012 sec
 Running org.apache.tika.parser.microsoft.OutlookParserTest
 Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.234 sec
 Running org.apache.tika.parser.microsoft.ooxml.OOXMLContainerExtractionTest
 Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.771 sec
 Running org.apache.tika.parser.microsoft.ooxml.OOXMLParserTest
 Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.282 sec
 Running org.apache.tika.parser.microsoft.PowerPointParserTest
 Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.005 sec
 Running org.apache.tika.parser.microsoft.WordParserTest
 Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.027 sec
 Running org.apache.tika.parser.microsoft.ExcelParserTest
 Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.849 sec

 Results :

 Tests run: 305, Failures: 0, Errors: 0, Skipped: 0

 [TASKS] Skipping maven reporter: there is already a result available.
 [JENKINS] Recording test results
 [INFO]
 [INFO] --- maven-bundle-plugin:2.1.0:bundle (default-bundle) @ tika-parsers
 ---
 [WARNING] Warning building bundle
 org.apache.tika:tika-parsers:bundle:1.0-SNAPSHOT : Split package
 org/apache/tika/detect
 Use directive -split-package:=(merge-first|merge-last|error|first) on
 Export/Private Package instruction to get rid of this warning
 Package found in   [Jar:., Jar:tika-core]
 Reference from 
 https://builds.apache.org/job/Tika-trunk/ws/trunk/tika-core/target/tika-core-1.0-SNAPSHOT.jar
 
 Classpath  [Jar:., Jar:tika-core, Jar:netcdf, Jar:slf4j-api,
 Jar:apache-mime4j, Jar:commons-logging, Jar:commons-compress,
 Jar:commons-codec, Jar:pdfbox, Jar:fontbox, Jar:jempbox, Jar:bcmail-jdk15,
 Jar:bcprov-jdk15, Jar:poi, Jar:poi-scratchpad, Jar:poi-ooxml,
 Jar:poi-ooxml-schemas, Jar:xmlbeans, Jar:dom4j,
 Jar:geronimo-stax-api_1.0_spec, Jar:tagsoup, Jar:asm,
 Jar:metadata-extractor, Jar:boilerpipe, Jar:rome, 

Re: [WARNING] Index corruption and crashes in Apache Lucene Core / Apache Solr with Java 7

2011-07-28 Thread Oleg Tikhonov
FYI,

On Fri, Jul 29, 2011 at 12:13 AM, Uwe Schindler uschind...@apache.orgwrote:

 Hello Apache Lucene  Apache Solr users,
 Hello users of other Java-based Apache projects,

 Oracle released Java 7 today. Unfortunately it contains hotspot compiler
 optimizations, which miscompile some loops. This can affect code of several
 Apache projects. Sometimes JVMs only crash, but in several cases, results
 calculated can be incorrect, leading to bugs in applications (see Hotspot
 bugs 7070134 [1], 7044738 [2], 7068051 [3]).

 Apache Lucene Core and Apache Solr are two Apache projects, which are
 affected by these bugs, namely all versions released until today. Solr
 users
 with the default configuration will have Java crashing with SIGSEGV as soon
 as they start to index documents, as one affected part is the well-known
 Porter stemmer (see LUCENE-3335 [4]). Other loops in Lucene may be
 miscompiled, too, leading to index corruption (especially on Lucene trunk
 with pulsing codec; other loops may be affected, too - LUCENE-3346 [5]).

 These problems were detected only 5 days before the official Java 7
 release,
 so Oracle had no time to fix those bugs, affecting also many more
 applications. In response to our questions, they proposed to include the
 fixes into service release u2 (eventually into service release u1, see
 [6]).
 This means you cannot use Apache Lucene/Solr with Java 7 releases before
 Update 2! If you do, please don't open bug reports, it is not the
 committers' fault! At least disable loop optimizations using the
 -XX:-UseLoopPredicate JVM option to not risk index corruptions.

 Please note: Also Java 6 users are affected, if they use one of those JVM
 options, which are not enabled by default: -XX:+OptimizeStringConcat or
 -XX:+AggressiveOpts

 It is strongly recommended not to use any hotspot optimization switches in
 any Java version without extensive testing!

 In case you upgrade to Java 7, remember that you may have to reindex, as
 the
 unicode version shipped with Java 7 changed and tokenization behaves
 differently (e.g. lowercasing). For more information, read
 JRE_VERSION_MIGRATION.txt in your distribution package!

 On behalf of the Lucene project,
 Uwe

 [1] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7070134
 [2] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7044738
 [3] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7068051
 [4] https://issues.apache.org/jira/browse/LUCENE-3335
 [5] https://issues.apache.org/jira/browse/LUCENE-3346
 [6] http://s.apache.org/StQ

 -
 Uwe Schindler
 uschind...@apache.org
 Apache Lucene PMC Member / Committer
 Bremen, Germany
 http://lucene.apache.org/





Re: http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException

2011-08-22 Thread Oleg Tikhonov
Hey,

and welcome to the Tika.

Using Eclipse you would better download an eclipse plug-in:
http://m2eclipse.sonatype.org/sites/m2e

Having downloaded and installed plug-in, your next step could be importing
Tika project like that:  ' *File* -* Import* - *Existing Maven Project* '
...

However, if you only wanted to Tika be compiled/packaged, run the following
command:

mvn clean install [-Dmaven.*test*.*skip*=true or -DskipTests=true]

BR,

Oleg



On Mon, Aug 22, 2011 at 7:33 PM, prince shah princeelect...@gmail.comwrote:

 Hi Geeks,

 I am new to Open source community and wanted to start with Tika project.
 I checkout latest version of Tika (1160218). Then went to my tika-site and
 hit mvn install (I have mac) it download bunch of stuff and in the end it
 spit out following exceptions. Can any one help me.

 Is there any other way to checkout the source code and step up eclipse for
 debugging ?
 Here is the stack trace :

 Results :

 Failed tests:
  testHttpServerFileExtensions(org.apache.tika.TikaTest):
 expected:...type1 but was:...printer-metric

 Tests run: 100, Failures: 1, Errors: 0, Skipped: 0

 [INFO]
 
 [INFO] Reactor Summary:
 [INFO]
 [INFO] Apache Tika parent  SUCCESS [4.350s]
 [INFO] Apache Tika core .. FAILURE
 [38.531s]
 [INFO] Apache Tika parsers ... SKIPPED
 [INFO] Apache Tika application ... SKIPPED
 [INFO] Apache Tika OSGi bundle ... SKIPPED
 [INFO] Apache Tika ... SKIPPED
 [INFO]
 
 [INFO] BUILD FAILURE
 [INFO]
 
 [INFO] Total time: 45.636s
 [INFO] Finished at: Mon Aug 22 00:51:29 PDT 2011
 [INFO] Final Memory: 15M/81M
 [INFO]
 
 [ERROR] Failed to execute goal
 org.apache.maven.plugins:maven-surefire-plugin:2.7.2:test (default-test) on
 project tika-core: There are test failures.
 [ERROR]
 [ERROR] Please refer to
 /Users/princesh/Documents/workspace/Tika/tika-core/target/surefire-reports
 for the individual test results.
 [ERROR] - [Help 1]
 [ERROR]
 [ERROR] To see the full stack trace of the errors, re-run Maven with the -e
 switch.
 [ERROR] Re-run Maven using the -X switch to enable full debug logging.
 [ERROR]
 [ERROR] For more information about the errors and possible solutions,
 please
 read the following articles:
 [ERROR] [Help 1]
 http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
 [ERROR]
 [ERROR] After correcting the problems, you can resume the build with the
 command
 [ERROR]   mvn goals -rf :tika-core


 Thank you



 --
 Prince Shah
 424-832-6296
 959 Rich Ave, Apt 28
 Mountain View, CA 94040

 Software Developer
 http://www.shahprince.info



Re: [jira] [Created] (TIKA-699) Automatic checks against backwards-incompatible API changes

2011-08-26 Thread Oleg Tikhonov
I'm in favor, +1.

On Fri, Aug 26, 2011 at 1:22 PM, Jukka Zitting (JIRA) j...@apache.orgwrote:

 Automatic checks against backwards-incompatible API changes
 ---

 Key: TIKA-699
 URL: https://issues.apache.org/jira/browse/TIKA-699
 Project: Tika
  Issue Type: Improvement
Reporter: Jukka Zitting


 As we get closer to 1.x we should add tooling like the Maven Clirr plugin
 [1] to guard against accidental backwards-incompatible API changes.

 [1] http://mojo.codehaus.org/clirr-maven-plugin/

 --
 This message is automatically generated by JIRA.
 For more information on JIRA, see: http://www.atlassian.com/software/jira





Re: Welcome Mike McCandless to the Tika PMC and as a Tika Committer

2011-08-29 Thread Oleg Tikhonov
Hi Make! Congrats!

I worked with OmniFind edition at IBM Jerusalem :-) ..., I heard about you
from my colleagues (Josemina, Yariv) and now met you here! Welcome!




On Mon, Aug 29, 2011 at 6:14 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 Thanks Chris!

 Here's a quick intro:

 I now work at IBM, who has (generously: thank you!) sponsored my
 contributions to Lucene/Solr for a long time now (like 5 years, wow!).

 Before that I was co-founder of a startup called iPhrase Technologies,
 selling enterprise search software; we didn't use Lucene but wanted
 to.  And we would have used Tika in a heartbeat if it had been around
 back then!

 Before that, long ago, in a galaxy far away, I got a PhD at MIT
 (Computer science).

 I've been impressed by Tika for quite some time, watching it from a
 distance, using it here and there.  I love how simple its API is, and
 that basic usage can simply invoke the command-line tool.  And
 it's solving an incredibly important problem -- unlocking the text
 inside the zillions of document formats we all use now.

 Writing the Tika chapter in Lucene in Action 2nd Edition (replacing
 the previous custom framework that was used for the first edition) was
 great fun.

 I'm happy to be on board and I'm looking forward to improving Tika.
 It's all still very new to me so I will tread lightly...

 Mike McCandless

 http://blog.mikemccandless.com

 On Mon, Aug 29, 2011 at 10:48 AM, Mattmann, Chris A (388J)
 chris.a.mattm...@jpl.nasa.gov wrote:
  Hi Folks,
 
  The Tika PMC just elected Mike McCandless as a Tika PMC member and
 committer. Mike's
  made a number of valuable contributions to Tika over the years and is a
 longtime contributor to the
  Apache Lucene project.
 
  Mike, feel free to say a bit about yourself and welcome aboard!
 
  Cheers,
  Chris
  (on behalf of the Tika PMC)
 
  ++
  Chris Mattmann, Ph.D.
  Senior Computer Scientist
  NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
  Office: 171-266B, Mailstop: 171-246
  Email: chris.a.mattm...@nasa.gov
  WWW:   http://sunset.usc.edu/~mattmann/
  ++
  Adjunct Assistant Professor, Computer Science Department
  University of Southern California, Los Angeles, CA 90089 USA
  ++
 
 



Re: [jira] [Commented] (TIKA-546) Add ability to create language profiles to tika-app

2011-09-17 Thread Oleg Tikhonov
Yes, it's resolved, need to change the status.


2011/9/18 Jan Høydahl (JIRA) j...@apache.org


[
 https://issues.apache.org/jira/browse/TIKA-546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13107290#comment-13107290]

 Jan Høydahl commented on TIKA-546:
 --

 What's the state of this issue? It says unresolved but something is
 committed?

  Add ability to create language profiles to tika-app
  ---
 
  Key: TIKA-546
  URL: https://issues.apache.org/jira/browse/TIKA-546
  Project: Tika
   Issue Type: New Feature
   Components: cli, languageidentifier
 Affects Versions: 0.7
 Reporter: Jan Høydahl
 Assignee: Chris A. Mattmann
  Attachments: TIKA-546.tikhonov.18042011.PATCH
 
 
  Since TIKA-490 it is supposed to be easy adding new language profiles to
 TIKA. However, currently the process involves using Nutch's NGramProfile
 tool and editing the output.
  We should port Nutch's profile builder to Tika and make it part of
 tika-app.jar:
  # See http://wiki.apache.org/nutch/LanguageIdentifier
  # java -jar tika-app.jar --create-profile [--gramsizes=n,n,...]
 [--maxlines=max] profile-name filename encoding
  Using --gramsizes and --maxlines, we could support both Tika-style
 profiles and Nutch-style profiles and thus deprecate the Nutch tool.
 Defaults should be --gramsizes=3 --maxlines=1000

 --
 This message is automatically generated by JIRA.
 For more information on JIRA, see: http://www.atlassian.com/software/jira





Re: [VOTE] Apache Tika 0.10 release rc #1

2011-09-26 Thread Oleg Tikhonov
In favor of releasing the Tika 0.10, +1



On Mon, Sep 26, 2011 at 9:50 AM, Mattmann, Chris A (388J) 
chris.a.mattm...@jpl.nasa.gov wrote:

 Hi Folks,

 A first release candidate for the Tika 0.10 release is available at:

http://people.apache.org/~mattmann/apache-tika-0.10/rc1/

 The release candidate is a zip archive of the sources in:

http://svn.apache.org/repos/asf/tika/tags/0.10/

 The SHA1 checksum of the archive is
 355d0b2fa0de232672e4760941ea0dcf641a82ad.

 A staged Maven repository is available at:

 https://repository.apache.org/content/repositories/orgapachetika-100/

 Please vote on releasing this package as Apache Tika 0.10.
 The vote is open for the next 72 hours and passes if a majority of at
 least three +1 Tika PMC votes are cast.

[ ] +1 Release this package as Apache Tika 0.10
[ ] -1 Do not release this package because...

 Cheers,
 Chris

 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++




Re: [jira] [Commented] (TIKA-513) Support of Deja Vu (DjVu) format

2011-10-07 Thread Oleg Tikhonov
There is the one (GPL) I've been playing with:

http://javadjvu.foxtrottechnologies.com/

However, in order to extract text/context from images, we have to find
suitable implementation of OCR.





On Fri, Oct 7, 2011 at 11:02 AM, Jukka Zitting (Commented) (JIRA) 
j...@apache.org wrote:


[
 https://issues.apache.org/jira/browse/TIKA-513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13122639#comment-13122639]

 Jukka Zitting commented on TIKA-513:
 

 Is there a DjVu parser we could use?

  Support of Deja Vu (DjVu) format
  
 
  Key: TIKA-513
  URL: https://issues.apache.org/jira/browse/TIKA-513
  Project: Tika
   Issue Type: New Feature
   Components: parser
 Reporter: Oleg Tikhonov
 
  It might be great if Tika could provide such a parser. Any
 suggestions/thoughts?

 --
 This message is automatically generated by JIRA.
 If you think it was sent incorrectly, please contact your JIRA
 administrators:
 https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
 For more information on JIRA, see: http://www.atlassian.com/software/jira





Re: [jira] [Commented] (TIKA-245) Support of CHM Format

2011-10-22 Thread Oleg Tikhonov
Hi Tran Nam Quang,
Currently our CHM extractor skips all entities that are not HTML.
It would be great if you could write a list of desired entities to be
extracted. In addition, if you can, please attach the CHM files you're
working with.

BR,
Oleg



On Sat, Oct 22, 2011 at 8:08 AM, Tran Nam Quang (Commented) (JIRA) 
j...@apache.org wrote:


[
 https://issues.apache.org/jira/browse/TIKA-245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13133260#comment-13133260]

 Tran Nam Quang commented on TIKA-245:
 -

 @ Oleg
 I tested the CHM parser from Tika 0.10 on a few sample CHM files and found
 that many valid CHM entries are skipped. For comparison, I ran the same test
 with the chm4j library, which does _not_ skip these entries. Do you know
 about this problem?

  Support of CHM Format
  -
 
  Key: TIKA-245
  URL: https://issues.apache.org/jira/browse/TIKA-245
  Project: Tika
   Issue Type: New Feature
   Components: parser
  Environment: All
 Reporter: Karl Heinz Marbaise
 Assignee: Chris A. Mattmann
 Priority: Minor
  Fix For: 0.10
 
  Attachments: TIKA-245.oleg.20110806.PATCH,
 TIKA-245.tikhonov.04082011.patch.txt, TIKA-245.tikhonov.20103107.patch.txt,
 TIKA-245.tikhonov.20112603.txt, TIKA-245.tikhonov.20112703.txt
 
 
  It might be a good idea to support the CHM File format of Windows. Some
 information about
 http://en.wikipedia.org/wiki/Microsoft_Compiled_HTML_Help#Extracting_to_HTML.
 The CHM format contains HTML files which can be parsed by Tika. So the
 only problem is to extract the data from the CHM file.

 --
 This message is automatically generated by JIRA.
 If you think it was sent incorrectly, please contact your JIRA
 administrators:
 https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
 For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: location of pdfbox in sources of Tika

2011-10-31 Thread Oleg Tikhonov
Hi Ahmad,

I hope you built pdfbox using a maven, i.e. running mvn clean install. If
so, a new pdfbox jar file is located in the .m2 local repository.
In addition, please find a pom.xml under ../tika-parsers and change the
following:

dependency
  groupIdorg.apache.pdfbox/groupId
  artifactIdpdfbox/artifactId
  version1.5.0/version
/dependency

Change also a version tag to the appropriate number. Then, go to
../tika-site (top level directory of tika  project) and rerun mvn clean
install.

If all were right you will have a new tika .

Hope it helps,

BR,
Oleg



On Mon, Oct 31, 2011 at 7:36 PM, ahmad ajiloo ahmad.aji...@gmail.comwrote:

 Hello
 I have an edited file in pdfbox project and want to rebuild Tika with
 this new file. But i can't find location of pdfbox sources in Tika
 sources to change that. can anyone help me?
 thanks



Re: location of pdfbox in sources of Tika

2011-11-01 Thread Oleg Tikhonov
You do not only change dependencies in the pom, you also rebuild the Tika.
In your case:
1. After changing the pdfbox, rebuild it
2. Change dependencies in the pom.xml
3. Rebuild the Tika

If there wasn't API changes in pdfbox, should work.


On Tue, Nov 1, 2011 at 12:32 PM, Ahmad Ajiloo ahmad.aji...@gmail.comwrote:

 thanks.
 But I think the problem is not dissolved by change the version of pdfbox in
 dependencies, If I want to change a solitary java file in pdfbox. Is there
 any solution to change only one file of pdfbox and rebuild Tika with new
 one?

 On Mon, Oct 31, 2011 at 10:10 PM, Oleg Tikhonov o...@apache.org wrote:

  Hi Ahmad,
 
  I hope you built pdfbox using a maven, i.e. running mvn clean install. If
  so, a new pdfbox jar file is located in the .m2 local repository.
  In addition, please find a pom.xml under ../tika-parsers and change the
  following:
 
 dependency
   groupIdorg.apache.pdfbox/groupId
   artifactIdpdfbox/artifactId
   version1.5.0/version
 /dependency
 
  Change also a version tag to the appropriate number. Then, go to
  ../tika-site (top level directory of tika  project) and rerun mvn clean
  install.
 
  If all were right you will have a new tika .
 
  Hope it helps,
 
  BR,
  Oleg
 
 
 
  On Mon, Oct 31, 2011 at 7:36 PM, ahmad ajiloo ahmad.aji...@gmail.com
  wrote:
 
   Hello
   I have an edited file in pdfbox project and want to rebuild Tika with
   this new file. But i can't find location of pdfbox sources in Tika
   sources to change that. can anyone help me?
   thanks
  




Re: [jira] [Commented] (TIKA-855) Language Detection not working for Japanese and Chinese.

2012-02-01 Thread Oleg Tikhonov
For Chinese we need to create/get two profiles: Chinese Traditional and
Chinese Simplified.

Oleg

On Thu, Feb 2, 2012 at 6:13 AM, James Sullivan (Commented) (JIRA) 
j...@apache.org wrote:


[
 https://issues.apache.org/jira/browse/TIKA-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13198521#comment-13198521]

 James Sullivan commented on TIKA-855:
 -

 If it is just a missing language profile issue let me know what is needed
 as at least for Japanese I am aware of number of large publicly available
 corpora that might be suitable and may be able to help generate the
 profiles. However, it sounds like there might be more to it than just
 generating the profile...I have added this as feature request TIKA-856.

  Language Detection not working for Japanese and Chinese.
  
 
  Key: TIKA-855
  URL: https://issues.apache.org/jira/browse/TIKA-855
  Project: Tika
   Issue Type: Bug
   Components: languageidentifier
 Affects Versions: 1.0
  Environment: Windows XP, Vista and Linux Ubuntu 11.10 using Sun
 Java 6 and Oracle Java 7
 Reporter: James Sullivan
 Assignee: Ken Krugler
 Priority: Minor
   Labels: Chinese, Japanese
 
  I have tried Tika 1.0 language detection (java -jar tika.jar -l
 .\Japanese.txt) on several Japanese files (both PDF and text files) and it
 consistently returns lt (Lithuanian???) instead of ja. I also tried on a
 Chinese file which similarly incorrectly returned lt. Both English language
 and French language detection worked correctly.

 --
 This message is automatically generated by JIRA.
 If you think it was sent incorrectly, please contact your JIRA
 administrators:
 https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
 For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: [VOTE] Apache Tika 1.1 release rc #1

2012-03-08 Thread Oleg Tikhonov
Here is my +1, this time tested only on Windows 7 x86-64 PE.

BR,
Oleg

On Thu, Mar 8, 2012 at 5:11 PM, Alex Ott alex...@gmail.com wrote:

 +1

 unpacked sources, compiled, tests passed. compiled tika-app works
 correctly.

 separately downloaded tika-app-1.1.jar also works correctly for me

 The small problem is that md5sum file for tika-app-1.1.jar isn't
 correctly formatted - file name is missing, so md5sum -c can't check
 it

 P.S. System, Debian Linux testing, JVM version 1.6.0_26

 On Wed, Mar 7, 2012 at 10:35 PM, Mattmann, Chris A (388J)
 chris.a.mattm...@jpl.nasa.gov wrote:
  Hi Folks,
 
  A candidate for the Tika 1.1 release is available at:
 
   http://people.apache.org/~mattmann/apache-tika-1.1/rc1/
 
  The release candidate is a zip archive of the sources in:
 
http://svn.apache.org/repos/asf/tika/tags/1.1/
 
  The SHA1 checksum of the archive is
 d3185bb22fa3c7318488838989aff0cc9ee025df.
 
  Please vote on releasing this package as Apache Tika 1.1.
  The vote is open for at least the next 72 hours and passes if a majority
 of at
  least three +1 Tika PMC votes are cast.
 
[ ] +1 Release this package as Apache Tika 1.1
[ ] -1 Do not release this package because...
 
  Thanks!
 
  Cheers,
  Chris
 
  P.S. Here's my +1.
 
  ++
  Chris Mattmann, Ph.D.
  Senior Computer Scientist
  NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
  Office: 171-266B, Mailstop: 171-246
  Email: chris.a.mattm...@nasa.gov
  WWW:   http://sunset.usc.edu/~mattmann/
  ++
  Adjunct Assistant Professor, Computer Science Department
  University of Southern California, Los Angeles, CA 90089 USA
  ++
 



 --
 With best wishes,Alex Ott
 http://alexott.net/
 Tiwtter: alexott_en (English), alexott (Russian)
 Skype: alex.ott



Re: [VOTE] Apache Tika 1.2 release rc #1

2012-07-11 Thread Oleg Tikhonov
Hi,

here is my +1.

Kind regards,
Oleg


On Thu, Jul 12, 2012 at 2:48 AM, Jukka Zitting jukka.zitt...@gmail.comwrote:

 Hi,

 On Wed, Jul 11, 2012 at 4:27 PM, Mattmann, Chris A (388J)
 chris.a.mattm...@jpl.nasa.gov wrote:
  On Jul 11, 2012, at 6:43 AM, Michael McCandless wrote:
  Why are there original-tika-app* files in the RC directory?
 
  Good question: this is the first time I've seen them in there too.

 I think they're coming from the shade plugin that we're now using
 instead of the bundle plugin to build the tika-app jar. I'll look at
 updating the release scripting to skip those files.

  Also, we used to name it apache-tika-1.1.src.* but now we dropped the
  apache- prefix?  Is that intentional?  (tika-app jar has never had the
  apache prefix I think...).
 
  Yeah good point -- I normally rename the artifact to apache-

 We can update the release script to do that automatically if we like.
 I originally set it up to use just the tika name since that's the
 pattern we've been following also in Jackrabbit, based originally on
 examples from HTTP Server and Lucene.

 BR,

 Jukka Zitting



Re: [VOTE] Graduate Apache Any23 from the Apache Incubator

2012-08-06 Thread Oleg Tikhonov
Hi Guys,

+1 for the graduation. Keep going !

KR,
Oleg


On Mon, Aug 6, 2012 at 11:44 PM, Dave Meikle loo...@gmail.com wrote:

 Hi,

 On 3 Aug 2012, at 18:50, Mattmann, Chris A (388J) 
 chris.a.mattm...@jpl.nasa.gov wrote:

  ...
  I'm now going to call for a community VOTE (before heading to the
 Incubator
  to make it official) for Any23 to graduate from the Incubator. VOTEs are
 open
  to Any23 and Tika community members and I'll leave the VOTE open for at
 least
  the next 72 hours before heading to the Incubator to formalize it.
 
  [ ] +1 Graduate Any23 from the Apache Incubator.
  [ ] +0 Don't care.
  [ ] -1  Don't graduate Any23 from the Apache Incubator because…
  …

 +1 for graduation from me.  Great job guys.

 Cheers,
 Dave




Re: [jira] [Commented] (TIKA-93) OCR support

2012-11-06 Thread Oleg Tikhonov
Hey,
I've tried to look up the distribution, however could not find the sources,
in binaries they provide only Nokia distribution.

It would be nice if you could play with it and say your impression(s).

BR,
Oleg


On Wed, Nov 7, 2012 at 2:52 AM, Pei Chen (JIRA) j...@apache.org wrote:


 [
 https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13491996#comment-13491996]

 Pei Chen commented on TIKA-93:
 --

 Have you seen JavaOCR (pure java ocr and BSD licensed):
 http://sourceforge.net/projects/javaocr/
 I have not tried it out myself yet (looks like 1.0 was just released about
 1 week ago).
 I think a pure java implementation may be easier than forking another
 process (exec cpp) or introduce jni dependencies.
 If interested, I could give it a whirl the next chance I get...


  OCR support
  ---
 
  Key: TIKA-93
  URL: https://issues.apache.org/jira/browse/TIKA-93
  Project: Tika
   Issue Type: New Feature
   Components: parser
 Reporter: Jukka Zitting
 Priority: Minor
 
  I don't know of any decent open source pure Java OCR libraries, but
 there are command line OCR tools like Tesseract (
 http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to
 extract text content (where available) from image files.

 --
 This message is automatically generated by JIRA.
 If you think it was sent incorrectly, please contact your JIRA
 administrators
 For more information on JIRA, see: http://www.atlassian.com/software/jira



Re: [jira] [Commented] (TIKA-1041) Tika 1.2 universalcharset errors

2012-12-12 Thread Oleg Tikhonov
Hi David,

in the same folder level, say /home/tika/, where you run 'mvn clean
install' just put the following command:

mvn dependency:list

It will print out all the jars which a project depends on.

Hope it helps.



On Wed, Dec 12, 2012 at 3:35 PM, David Morana (JIRA) j...@apache.orgwrote:

 Is there a central list somewhere?


Re: [jira] [Commented] (TIKA-1041) Tika 1.2 universalcharset errors

2012-12-12 Thread Oleg Tikhonov
David, is it failing on some particular file or always, never mind what
goes on?
POI hints that there is illegal offset, that probably is a cause of the
error.

--Oleg



On Wed, Dec 12, 2012 at 4:31 PM, David Morana (JIRA) j...@apache.orgwrote:


 [
 https://issues.apache.org/jira/browse/TIKA-1041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13529984#comment-13529984]

 David Morana commented on TIKA-1041:
 

 after some research, I upgraded the POI jars to 3.9 ( I was at v3.8 beta)
 but no luck I'm still getting the error above

  Tika 1.2 universalcharset errors
  
 
  Key: TIKA-1041
  URL: https://issues.apache.org/jira/browse/TIKA-1041
  Project: Tika
   Issue Type: Bug
 Affects Versions: 1.2
  Environment: I'm running solr 4.0 with tika 1.2 on tomcat 7.0.8
 with manifoldcf v1.1dev
 Reporter: David Morana
  Fix For: 1.2, 1.3
 
 
  This is somewhat confusing and frustrating. I successfully crawled
 Opentext using all of the above. then I recrawled and it aborted almost
 immediately.
  It choked on images, so I excluded them for now.
  but now it's choking on txt files!
  sometimes I get this error
  SEVERE: null:java.lang.RuntimeException: java.lang.NoClassDefFoundError:
 org/mozilla/universalchardet/CharsetListener
  and sometimes I get this one
  SEVERE: null:java.lang.RuntimeException: java.lang.NoClassDefFoundError:
 org/apache/tika/parser/txt/UniversalEncodingListener

 --
 This message is automatically generated by JIRA.
 If you think it was sent incorrectly, please contact your JIRA
 administrators
 For more information on JIRA, see: http://www.atlassian.com/software/jira



Re: [jira] [Updated] (TIKA-1048) XMLParser should add whitespace between elements

2012-12-20 Thread Oleg Tikhonov
Hi Make,

May be consider using of UIMA (the rule engine) ?

BR,
Oleg



On Thu, Dec 20, 2012 at 1:05 PM, Michael McCandless (JIRA)
j...@apache.orgwrote:


  [
 https://issues.apache.org/jira/browse/TIKA-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]

 Michael McCandless updated TIKA-1048:
 -

 Attachment: TIKA-1048.patch

 Patch w/ failing test ... I'm not sure where/how to best fix this yet ...

  XMLParser should add whitespace between elements
  
 
  Key: TIKA-1048
  URL: https://issues.apache.org/jira/browse/TIKA-1048
  Project: Tika
   Issue Type: Bug
   Components: parser
 Reporter: Michael McCandless
  Fix For: 1.3
 
  Attachments: TIKA-1048.patch
 
 
  If the incoming XML is compact (ie doesn't have whitespace between
 elements), I think we should somehow add whitespace between elements when
 extracting text?

 --
 This message is automatically generated by JIRA.
 If you think it was sent incorrectly, please contact your JIRA
 administrators
 For more information on JIRA, see: http://www.atlassian.com/software/jira



Re: [jira] [Commented] (TIKA-93) OCR support

2013-01-04 Thread Oleg Tikhonov
I've tried without success. There is more than it seems. JavaOCR is not an
option in its current status. Temporal solution can be wrapper of tesseract
however making tesseract to work on multi-platforms is still quite
difficult.

Best regards,
Oleg




On Fri, Jan 4, 2013 at 3:46 PM, Maciej Lizewski (JIRA) j...@apache.orgwrote:


 [
 https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543882#comment-13543882]

 Maciej Lizewski commented on TIKA-93:
 -

 anything new in this topic? someone tried that JavaOCR library with
 success? Does anybody has working tika+ocr configuration?

  OCR support
  ---
 
  Key: TIKA-93
  URL: https://issues.apache.org/jira/browse/TIKA-93
  Project: Tika
   Issue Type: New Feature
   Components: parser
 Reporter: Jukka Zitting
 Priority: Minor
 
  I don't know of any decent open source pure Java OCR libraries, but
 there are command line OCR tools like Tesseract (
 http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to
 extract text content (where available) from image files.

 --
 This message is automatically generated by JIRA.
 If you think it was sent incorrectly, please contact your JIRA
 administrators
 For more information on JIRA, see: http://www.atlassian.com/software/jira



Re: [jira] [Commented] (TIKA-93) OCR support

2013-01-14 Thread Oleg Tikhonov
From DejaVu (particular case) point of view possible flow can be as follows:
1. Extract images
2. For each image extract text using OCR
2.1 Detect language
2.2.Detect font type
.

So, language, font type may be used for providing metadata.
I think it should be seamless as much as possible.

It's also interesting what do you think/see/hope ...

Best regards,

Oleg



On Mon, Jan 14, 2013 at 10:58 PM, Pei Chen (JIRA) j...@apache.org wrote:


 [
 https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13553107#comment-13553107]

 Pei Chen commented on TIKA-93:
 --

 I tried their javaocr-20100605 release with just ascii scanned digits and
 it seems to worked as advertised.  It was fairly easy to use/setup-
 However, I noticed that their latest release have a lot of work geared
 towards android development.  I didn't get a chance to try integrating it
 with Tika yet however.
 Are there any preferences on how it should flow in the context of Tika?

  OCR support
  ---
 
  Key: TIKA-93
  URL: https://issues.apache.org/jira/browse/TIKA-93
  Project: Tika
   Issue Type: New Feature
   Components: parser
 Reporter: Jukka Zitting
 Priority: Minor
 
  I don't know of any decent open source pure Java OCR libraries, but
 there are command line OCR tools like Tesseract (
 http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to
 extract text content (where available) from image files.

 --
 This message is automatically generated by JIRA.
 If you think it was sent incorrectly, please contact your JIRA
 administrators
 For more information on JIRA, see: http://www.atlassian.com/software/jira



Re: [VOTE] Apache Tika 1.3 Release Candidate #1

2013-01-19 Thread Oleg Tikhonov
Hey Dave,

Could not test on other systems than Windows 7 x64. All tests passed
successfully !

[x] +1 Release this package as Apache Tika 1.3

BR,

Oleg


On Sat, Jan 19, 2013 at 6:30 AM, Dave Meikle loo...@gmail.com wrote:

 http://svn.apache.org/repos/asf/tika/tags/tika-1.3/


Re: [DISCUSS] Should Tika require Java6? (was Re: Build failed in Jenkins: Tika-trunk #977)

2013-02-08 Thread Oleg Tikhonov
Back to the future. Aha moment !!!
Here is mine +1.

According to Oracle In February 2011 Oracle announced the End of Public
Updates for their Java SE 6 products for July 2012. In February 2012 Oracle
extended the End of Public Updates for 4 months, to November 2012. .

Oleg



On Fri, Feb 8, 2013 at 6:54 PM, Mattmann, Chris A (388J) 
chris.a.mattm...@jpl.nasa.gov wrote:

 Hey Guys,

 Just to summarize, the question on the table is whether or not Tika should
 require Java6. We had some discussions on this previously (if I get time,
 will dig up the threads -- ok found time ;) ):

 https://issues.apache.org/jira/browse/TIKA-888

 http://mail-archives.apache.org/mod_mbox/tika-dev/201011.mbox/%3CC8F38B50.2
 3828%25chris.a.mattm...@jpl.nasa.gov%3E


 I'm +1 for it. Seems like so is Mike, and also Ken K. Any objections from
 others to require Java6?

 Cheers,
 Chris


 On 2/8/13 6:32 AM, Ken Krugler kkrugler_li...@transpac.com wrote:

 
 On Feb 8, 2013, at 3:54am, Michael McCandless wrote:
 
  On Thu, Feb 7, 2013 at 3:51 PM, Nick Burch apa...@gagravarr.org
 wrote:
  On Thu, 7 Feb 2013, Michael McCandless wrote:
 
  Hmm it looks like the Tika build is failing on Jenkins due to this:
 
  [ERROR]
 
 /home/jenkins/jenkins-slave/workspace/Tika-trunk/trunk/tika-server/src/
 main/java/org/apache/tika/server/CSVMessageBodyWriter.java:[51,3]
  method does not override a method from its superclass
  [ERROR]
 
 /home/jenkins/jenkins-slave/workspace/Tika-trunk/trunk/tika-server/src/
 main/java/org/apache/tika/server/JSONMessageBodyWriter.java:[51,3]
  method does not override a method from its superclass
 
 
  Is that an @Override of a method from an Interface? That works on JDK
 1.6+,
  but isn't valid on JDK 1.5. Just remove the @Override from the
 interface
  implementing methods and you should be fine
 
  Ahh that's right.  I had forgotten about this.  I commented out
  those two @Overrides ...
 
  Maybe ... it's time for Tika to require Java 1.6?  Java 1.6 is end of
  life next month after all Š
 
 +1
 
 Seems like being one generation behind is OK, but not two :)
 
 -- Ken
 
 --
 Ken Krugler
 +1 530-210-6378
 http://www.scaleunlimited.com
 custom big data solutions  training
 Hadoop, Cascading, Cassandra  Solr
 
 
 
 
 




Re: [jira] [Commented] (TIKA-245) Support of CHM Format

2013-03-05 Thread Oleg Tikhonov
Tika chm support has its limitations, can you provide such file(s) for
further investigation ?

BR,
Oleg


On Wed, Mar 6, 2013 at 1:10 AM, Tejas Patil (JIRA) j...@apache.org wrote:


 [
 https://issues.apache.org/jira/browse/TIKA-245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13594074#comment-13594074]

 Tejas Patil commented on TIKA-245:
 --

 I am working on NUTCH-1454 and I am observing that tika is not able to
 extract contents from chm documents. (i tried with several chm files but it
 worked for none). Chm viewer however could show entire contents of the
 file. I am not the only guy who is facing this issue (see [here|
 http://lucene.472066.n3.nabble.com/CHM-Files-and-Tika-tp3999735p4001245.html
 ])

  Support of CHM Format
  -
 
  Key: TIKA-245
  URL: https://issues.apache.org/jira/browse/TIKA-245
  Project: Tika
   Issue Type: New Feature
   Components: parser
  Environment: All
 Reporter: Karl Heinz Marbaise
 Assignee: Chris A. Mattmann
 Priority: Minor
  Fix For: 0.10
 
  Attachments: TIKA-245.oleg.20110806.PATCH,
 TIKA-245.tikhonov.04082011.patch.txt, TIKA-245.tikhonov.20103107.patch.txt,
 TIKA-245.tikhonov.20112603.txt, TIKA-245.tikhonov.20112703.txt
 
 
  It might be a good idea to support the CHM File format of Windows. Some
 information about
 http://en.wikipedia.org/wiki/Microsoft_Compiled_HTML_Help#Extracting_to_HTML.
 The CHM format contains HTML files which can be parsed by Tika. So the
 only problem is to extract the data from the CHM file.

 --
 This message is automatically generated by JIRA.
 If you think it was sent incorrectly, please contact your JIRA
 administrators
 For more information on JIRA, see: http://www.atlassian.com/software/jira



Re: [VOTE] Apache TIka 1.4 Release Candidate #1

2013-06-16 Thread Oleg Tikhonov
In favor,

[x] +1 Release this package as Apache Tika 1.4.

Tested on Linux ubuntu 3.8.0-23-generic x64.

May be we have to update some dependencies.
Also ran a code coverage using mvn plugin, cobertura.

BR,
Oleg


Here is a link to code coverage report  dependencies updates (available
@dev).
https://drive.google.com/folderview?id=0B_DmgPkneiMgOFg2ZXBsOTZkRHcusp=sharing



On Sun, Jun 16, 2013 at 6:52 AM, Chris Mattmann mattm...@apache.org wrote:

 Hi Guys,

 A candidate for the Tika 1.4 release is available at:

 http://people.apache.org/~mattmann/apache-tika-1.4/rc1/

 The release candidate is a zip archive of the sources in:

 http://svn.apache.org/repos/asf/tika/tags/1.4/


 The SHA1 checksum of the archive is
 1e523c6ed06b4d095d7f6e93a04a8d2ab43c7226.

 A staged M2 repository can also be found on repository.apache.org here:

 https://repository.apache.org/content/repositories/orgapachetika-020/


 Please vote on releasing this package as Apache Tika 1.4.
 The vote is open for the next 72 hours and passes if a majority of at
 least three +1 Tika PMC votes are cast.

 [ ] +1 Release this package as Apache Tika 1.4
 [ ] -1 Do not release this package because...

 Here is my +1 for the release.

 Cheers,
 Chris








Re: [VOTE] Apache TIka 1.4 Release Candidate #1

2013-06-16 Thread Oleg Tikhonov
In favor,

[x] +1 Release this package as Apache Tika 1.4.

Tested on Linux ubuntu 3.8.0-23-generic x64.

May be we have to update some dependencies.
Also ran a code coverage using mvn plugin, cobertura.

BR,
Oleg


Here is a link to code coverage report  dependencies updates (available
@dev).
https://drive.google.com/folderview?id=0B_DmgPkneiMgOFg2ZXBsOTZkRHcusp=sharing


On Sun, Jun 16, 2013 at 6:52 AM, Chris Mattmann mattm...@apache.org wrote:

 Hi Guys,

 A candidate for the Tika 1.4 release is available at:

 http://people.apache.org/~mattmann/apache-tika-1.4/rc1/

 The release candidate is a zip archive of the sources in:

 http://svn.apache.org/repos/asf/tika/tags/1.4/


 The SHA1 checksum of the archive is
 1e523c6ed06b4d095d7f6e93a04a8d2ab43c7226.

 A staged M2 repository can also be found on repository.apache.org here:

 https://repository.apache.org/content/repositories/orgapachetika-020/


 Please vote on releasing this package as Apache Tika 1.4.
 The vote is open for the next 72 hours and passes if a majority of at
 least three +1 Tika PMC votes are cast.

 [ ] +1 Release this package as Apache Tika 1.4
 [ ] -1 Do not release this package because...

 Here is my +1 for the release.

 Cheers,
 Chris








Re: [VOTE] Apache TIka 1.4 Release Candidate #1

2013-06-16 Thread Oleg Tikhonov
I've tried to send some comments about release candidate, however got
delivery failure error. I'm out of list ?

BR,
Oleg


On Sun, Jun 16, 2013 at 9:07 PM, Chris Mattmann mattm...@apache.org wrote:

 Ouch, just saw this. Oliver, I'm happy to commit the updated patch
 to the trunk but do you absolutely need this in 1.4 requiring me
 to spin up an RC #3?

 Cheers,
 Chris


 -Original Message-
 From: Oliver Heger oliver.he...@oliver-heger.de
 Reply-To: dev@tika.apache.org dev@tika.apache.org
 Date: Sunday, June 16, 2013 10:25 AM
 To: dev@tika.apache.org dev@tika.apache.org
 Subject: Re: [VOTE] Apache TIka 1.4 Release Candidate #1

 Am 16.06.2013 05:52, schrieb Chris Mattmann:
  Hi Guys,
 
  A candidate for the Tika 1.4 release is available at:
 
   http://people.apache.org/~mattmann/apache-tika-1.4/rc1/
 
  The release candidate is a zip archive of the sources in:
 
  http://svn.apache.org/repos/asf/tika/tags/1.4/
 
 
  The SHA1 checksum of the archive is
  1e523c6ed06b4d095d7f6e93a04a8d2ab43c7226.
 
  A staged M2 repository can also be found on repository.apache.org here:
 
  https://repository.apache.org/content/repositories/orgapachetika-020/
 
 
  Please vote on releasing this package as Apache Tika 1.4.
  The vote is open for the next 72 hours and passes if a majority of at
  least three +1 Tika PMC votes are cast.
 
   [ ] +1 Release this package as Apache Tika 1.4
   [ ] -1 Do not release this package because...
 
  Here is my +1 for the release.
 
  Cheers,
  Chris
 
 
 There is a minor issue with TIKA-991: The original patch had been
 applied, but in the meantime I discovered that the code could enter an
 infinite loop under certain circumstances. Therefore, I provided a
 second patch (the small attachment from Feb 15th). Could this patch be
 applied, too, before the release?
 
 Thanks
 Oliver
 





Re: [VOTE] Apache TIka 1.4 Release Candidate #2

2013-06-17 Thread Oleg Tikhonov
Hey,
All tests are passed on following platforms:
1. Linux ubuntu 3.8.0-25-generic x86_64 Ubuntu 13.04
2. Microsoft Windows 7 Enterprise, x64-based PC

Please have a look:
https://drive.google.com/?tab=moauthuser=0#folders/0B_DmgPkneiMgOFg2ZXBsOTZkRHc
There are two files, one of them contains list of dependencies updates, the
second one - is a code coverage report.

+1 for release 1.4-rc2

Cheers,
Oleg


On Tue, Jun 18, 2013 at 7:46 AM, Chris Mattmann mattm...@apache.org wrote:

 Hey Guys,

 Just FYI on this, the VOTE is still going if folks have a
 chance to review, would appreciate it. So far, we've got
 1 binding +1. :)

 Cheers,
 Chris



 -Original Message-
 From: jpluser mattm...@apache.org
 Reply-To: dev@tika.apache.org dev@tika.apache.org
 Date: Sunday, June 16, 2013 11:06 AM
 To: dev@tika.apache.org dev@tika.apache.org
 Cc: u...@tika.apache.org u...@tika.apache.org
 Subject: [VOTE] Apache TIka 1.4 Release Candidate #2

 Hi Guys,
 
 A second candidate for the Tika 1.4 release is available at:
 
 http://people.apache.org/~mattmann/apache-tika-1.4/rc2/
 
 The release candidate is a zip archive of the sources in:
 
 http://svn.apache.org/repos/asf/tika/tags/1.4-rc2/
 
 The SHA1 checksum of the archive is
 84ce9ebc104ca348a3cd8e95ec31a96169548c13
 
 A staged M2 repository can also be found on repository.apache.org here:
 
 https://repository.apache.org/content/repositories/orgapachetika-022/
 
 
 Please vote on releasing this package as Apache Tika 1.4.
 The vote is open for the next 72 hours and passes if a majority of at
 least three +1 Tika PMC votes are cast.
 
 [ ] +1 Release this package as Apache Tika 1.4
 [ ] -1 Do not release this package because...
 
 Here is my +1 for the release.
 
 Cheers,
 Chris
 
 
 
 
 
 
 
 
 
 
 





Re: [jira] [Updated] (TIKA-1152) Process stucks on parsing of a CHM file

2013-07-23 Thread Oleg Tikhonov
Hi, can you attach the problematic file ?
Thanks.


On Tue, Jul 23, 2013 at 4:46 PM, Hong-Thai Nguyen (JIRA) j...@apache.orgwrote:


  [
 https://issues.apache.org/jira/browse/TIKA-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]

 Hong-Thai Nguyen updated TIKA-1152:
 ---

 Attachment: eventcombmt.chm

  Process stucks on parsing of a CHM file
  ---
 
  Key: TIKA-1152
  URL: https://issues.apache.org/jira/browse/TIKA-1152
  Project: Tika
   Issue Type: Bug
   Components: parser
 Affects Versions: 1.4
  Environment: Windows/Linux
 Reporter: Hong-Thai Nguyen
 Priority: Critical
  Fix For: 1.5
 
  Attachments: eventcombmt.chm
 
 
  By parsing the attachment CHM file (MS Microsoft Help Files), Java
 process stucks.
  {code}
  Thread[main,5,main]
 
 org.apache.tika.parser.chm.lzx.ChmLzxBlock.extractContent(ChmLzxBlock.java:203)
 
 org.apache.tika.parser.chm.lzx.ChmLzxBlock.init(ChmLzxBlock.java:77)
 
 org.apache.tika.parser.chm.core.ChmExtractor.extractChmEntry(ChmExtractor.java:338)
 
 org.apache.tika.parser.chm.CHMDocumentInformation.getContent(CHMDocumentInformation.java:72)
 
 org.apache.tika.parser.chm.CHMDocumentInformation.getText(CHMDocumentInformation.java:141)
org.apache.tika.parser.chm.CHM2XHTML.process(CHM2XHTML.java:34)
org.apache.tika.parser.chm.ChmParser.parse(ChmParser.java:51)
 
 org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
org.apache.tika.parser.AbstractParser.parse(AbstractParser.java:53)
 
 com.polyspot.document.converter.DocumentConverter.realizeConversion(DocumentConverter.java:192)
 
 com.polyspot.document.converter.DocumentConverter.convert(DocumentConverter.java:114)
 
 com.polyspot.wscrawlers.PsDocConverter.getConvertedDocument(PsDocConverter.java:77)
 
 com.polyspot.wscrawlers.Converter.getConvertedDocument(Converter.java:81)
 
 com.polyspot.wscrawlers.AbstractConverter.getDirectConvertedDocument(AbstractConverter.java:139)
 
 com.polyspot.connector.framework.convert.PES5ConversionService.convert(PES5ConversionService.java:43)
 
 com.polyspot.connector.framework.convert.ConversionService.findDocumentSplitterAndCallConvert(ConversionService.java:362)
 
 com.polyspot.connector.framework.convert.ConversionService.convertAndGenerateThumbnailForMasterFile(ConversionService.java:291)
 
 com.polyspot.connector.framework.processors.ConvertAndMergeMasterFile.process(ConvertAndMergeMasterFile.java:40)
 
 com.polyspot.connector.framework.processors.SequenceDocumentProcessor.process(SequenceDocumentProcessor.java:21)
 
 com.polyspot.connector.framework.plugins.DocumentBuilderPlugin.computeDocument(DocumentBuilderPlugin.java:48)
 
 com.polyspot.connector.framework.plugins.PluginsManager.computeDocument(PluginsManager.java:219)
 
 com.polyspot.connector.framework.orchestrators.Orchestrator.processOutOfDateNode(Orchestrator.java:201)
 
 com.polyspot.connector.framework.orchestrators.Orchestrator.processGrantedNode(Orchestrator.java:172)
 
 com.polyspot.connector.framework.orchestrators.Orchestrator.processNode(Orchestrator.java:237)
 
 com.polyspot.connector.framework.orchestrators.Orchestrator.synchronize(Orchestrator.java:288)
 
 com.polyspot.connector.framework.orchestrators.OrchestratorMonoThreaded.requestSynchronizeOnceCreated(OrchestratorMonoThreaded.java:16)
 
 com.polyspot.connector.framework.orchestrators.Orchestrator.requestSynchronize(Orchestrator.java:108)
 
 com.polyspot.connector.framework.MonitoredNodeExecutor.requestChildExecution(MonitoredNodeExecutor.java:29)
 
 com.polyspot.connector.knowledgetree.model.content.KnowledgeTreeDocumentMetadata.synchronizeAllChildren(KnowledgeTreeDocumentMetadata.java:98)
 
 com.polyspot.connector.framework.orchestrators.Orchestrator.syncChildren(Orchestrator.java:311)
 
 com.polyspot.connector.framework.orchestrators.Orchestrator.processGrantedNode(Orchestrator.java:177)
 
 com.polyspot.connector.framework.orchestrators.Orchestrator.processNode(Orchestrator.java:237)
 
 com.polyspot.connector.framework.orchestrators.Orchestrator.synchronize(Orchestrator.java:288)
 
 com.polyspot.connector.framework.orchestrators.OrchestratorMonoThreaded.requestSynchronizeOnceCreated(OrchestratorMonoThreaded.java:16)
 
 com.polyspot.connector.framework.orchestrators.Orchestrator.requestSynchronize(Orchestrator.java:108)
 
 com.polyspot.connector.framework.MonitoredNodeExecutor.requestChildExecution(MonitoredNodeExecutor.java:29)
 
 com.polyspot.connector.knowledgetree.driver.db.DBKnowledgeTreeDriver.executeAllDocuments(DBKnowledgeTreeDriver.java:71)
 
 com.polyspot.connector.knowledgetree.driver.KnowledgeTreeDriver.executeAllDocuments(KnowledgeTreeDriver.java:107)
 
 

Re: [jira] [Comment Edited] (TIKA-1152) Process loops infinitely on parsing of a CHM file

2013-07-29 Thread Oleg Tikhonov
Thanks !

BR,
Oleg


On Mon, Jul 29, 2013 at 4:47 PM, Hong-Thai Nguyen (JIRA) j...@apache.orgwrote:


 [
 https://issues.apache.org/jira/browse/TIKA-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13716538#comment-13716538]

 Hong-Thai Nguyen edited comment on TIKA-1152 at 7/29/13 1:46 PM:
 -

 It's a bug on ChmLzxBlock.java on this faulty file. It leaves never loop
 when block type does not match.

   was (Author: thaichat04):
 It's a bug on ChmLzxBlock.java on this fautly file. It leave never
 loop when block type does not match.
 I'm ready to push a fix, but don't have ASF account on Tika project.

  Process loops infinitely on parsing of a CHM file
  -
 
  Key: TIKA-1152
  URL: https://issues.apache.org/jira/browse/TIKA-1152
  Project: Tika
   Issue Type: Bug
   Components: parser
 Affects Versions: 1.4
  Environment: Windows/Linux
 Reporter: Hong-Thai Nguyen
 Priority: Critical
  Fix For: 1.5
 
  Attachments: ChmLzxBlock.java.patch, eventcombmt.chm
 
 
  By parsing [the attachment CHM file|^eventcombmt.chm] (MS Microsoft Help
 Files), Java process stuck.
  {code}
  Thread[main,5,main]
 
 org.apache.tika.parser.chm.lzx.ChmLzxBlock.extractContent(ChmLzxBlock.java:203)
 
 org.apache.tika.parser.chm.lzx.ChmLzxBlock.init(ChmLzxBlock.java:77)
 
 org.apache.tika.parser.chm.core.ChmExtractor.extractChmEntry(ChmExtractor.java:338)
 
 org.apache.tika.parser.chm.CHMDocumentInformation.getContent(CHMDocumentInformation.java:72)
 
 org.apache.tika.parser.chm.CHMDocumentInformation.getText(CHMDocumentInformation.java:141)
org.apache.tika.parser.chm.CHM2XHTML.process(CHM2XHTML.java:34)
org.apache.tika.parser.chm.ChmParser.parse(ChmParser.java:51)
 
 org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
org.apache.tika.parser.AbstractParser.parse(AbstractParser.java:53)
 
 com.polyspot.document.converter.DocumentConverter.realizeConversion(DocumentConverter.java:192)
  ...
  {code}

 --
 This message is automatically generated by JIRA.
 If you think it was sent incorrectly, please contact your JIRA
 administrators
 For more information on JIRA, see: http://www.atlassian.com/software/jira



Re: Apache Tika for Android

2013-08-29 Thread Oleg Tikhonov
Hi, Vasily,
Welcome aboard !
Just keep in mind, Tika is written on Java, so it can run on any JVM which
supports that.
For starters please refer to: http://tika.apache.org/1.4/gettingstarted.html
Generally, Tika supports extracting most known type including PDFs.
Apache Tika is Apache Software License v.2, see
http://www.apache.org/licenses/LICENSE-2.0.
Yes, it can be used for commerce (development  sale).

Hope, it helps.





On Thu, Aug 29, 2013 at 4:55 PM, Василий Саржинский 
vasiliy.sarzhins...@mail.ru wrote:


 Hello!

 Is Apache Tika extract text from pdf file into txt on Android?
 Is Apache Tika free for commercial use (for development)?
 Thanks!


 With Best Regards,
 Vasiliy Sarzhinskiy


Re: Apache tika installation issue

2013-09-27 Thread Oleg Tikhonov
Hi,

if you meant how to import Tika's project then here the steps:

1. In Eclipse -- File -- Import ...
2. Choose Existing Maven Project, click Next;
3. Point to Tika project, clicking on Browse button, say tika-core
4. Next, click on Finish.

That's it.

Hope it helps.

BR,
Oleg





On Fri, Sep 27, 2013 at 9:48 AM, Mattmann, Chris A (398J) 
chris.a.mattm...@jpl.nasa.gov wrote:

 Dear Sudheer,

 Did you receive a reply to your question?

 Cheers,
 Chris

 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






 -Original Message-
 From: sudheer y sudhe...@datahubsoftware.in
 Date: Tuesday, September 17, 2013 12:02 AM
 To: dev-ow...@tika.apache.org dev-ow...@tika.apache.org
 Subject: Apache tika installation issue

 Dear Experts,
 
 
 Can you give step by step guide to install apache tika in eclipse using
 maven on windows.
 
 
 
 --
 Thanks  Best Regards,
 Sudheer Kumar Y
 
 Software Engineer
 
 DATAHUB SOFTWARE INDIA PVT LTD. | MAKING IT POSSIBLE
 Mobile: +91 8143161684
 
 Email: sudhe...@datahubsoftware.in
 WEB : www.datahubsoftware.com http://www.datahubsoftware.com
 
 
 
 
 
 
 
 
 
 




Re: Having Problem in Word Count and Language Detaction

2013-10-26 Thread Oleg Tikhonov
Hi Animesh,
my wild guess is that N-gram profile for Chinese wasn't trained pretty
well. Try recreate Chinese language profile.

Have a look here:
http://www.ibm.com/developerworks/opensource/tutorials/os-apache-tika/section6.html

Hope it helps.


On Sat, Oct 26, 2013 at 8:48 PM, Chris Mattmann mattm...@apache.org wrote:

 Hi Animesh,

 Please detail your issue here on dev@tika.apache.org and I'm sure
 someone can help.

 Cheers,
 Chris


 -Original Message-
 From: Animesh Kumar animesh.sa...@gmail.com
 Date: Wednesday, October 23, 2013 9:15 PM
 To: dev-ow...@tika.apache.org dev-ow...@tika.apache.org
 Subject: Fwd: Having Problem in Word Count and Language Detaction

 
 
 Sir/Mam,
 I am developing a web based software which use Apache Tika for getting
 Language and words Count of Uploaded file. Its working fine for English,
 Japanese , Hindi etc but giving wrong words count for Chinese. I am using
 tika-app-1.4.jar .
 and there is an another problem in word counting of file format different
 from doc and docx
 
 
 --
 With Thanks  Regards
 Animesh Kumar
 +918927992397 tel:%2B918927992397
 
 
 
 
 
 
 
 --
 With Thanks  Regards
 Animesh Kumar
 +918927992397 tel:%2B918927992397
 
 





Re: Having Problem in Word Count and Language Detaction

2013-10-26 Thread Oleg Tikhonov
This one is better
https://issues.apache.org/jira/browse/TIKA-546



On Sat, Oct 26, 2013 at 10:05 PM, Oleg Tikhonov o...@apache.org wrote:

 Hi Animesh,
 my wild guess is that N-gram profile for Chinese wasn't trained pretty
 well. Try recreate Chinese language profile.

 Have a look here:

 http://www.ibm.com/developerworks/opensource/tutorials/os-apache-tika/section6.html

 Hope it helps.


 On Sat, Oct 26, 2013 at 8:48 PM, Chris Mattmann mattm...@apache.orgwrote:

 Hi Animesh,

 Please detail your issue here on dev@tika.apache.org and I'm sure
 someone can help.

 Cheers,
 Chris


 -Original Message-
 From: Animesh Kumar animesh.sa...@gmail.com
 Date: Wednesday, October 23, 2013 9:15 PM
 To: dev-ow...@tika.apache.org dev-ow...@tika.apache.org
 Subject: Fwd: Having Problem in Word Count and Language Detaction

 
 
 Sir/Mam,
 I am developing a web based software which use Apache Tika for getting
 Language and words Count of Uploaded file. Its working fine for English,
 Japanese , Hindi etc but giving wrong words count for Chinese. I am using
 tika-app-1.4.jar .
 and there is an another problem in word counting of file format different
 from doc and docx
 
 
 --
 With Thanks  Regards
 Animesh Kumar
 +918927992397 tel:%2B918927992397
 
 
 
 
 
 
 
 --
 With Thanks  Regards
 Animesh Kumar
 +918927992397 tel:%2B918927992397
 
 






Re: NonSequentialPDFParser

2013-12-02 Thread Oleg Tikhonov
Think, we must. +1 for such improvement.

BR,
Oleg


On Mon, Dec 2, 2013 at 4:17 PM, Hong-Thai Nguyen 
hong-thai.ngu...@polyspot.com wrote:

 Hi all,
 NonSequentialPDFParser may increase 45% parsing performance on PDF
 extraction. Should we integrate in Tika ?
 https://issues.apache.org/jira/browse/PDFBOX-1104

 Thanks,

 Hong-Thai




Re: Switch to JUnit 4.x?

2013-12-14 Thread Oleg Tikhonov
Hi Ken,
no at all. +1 - go for it!


BR,
Oleg


On Sun, Dec 15, 2013 at 1:39 AM, Ken Krugler kkrugler_li...@transpac.comwrote:

 Hi all,

 See https://issues.apache.org/jira/browse/TIKA-1209

 Any objections to switching to JUnit 4.11?

 -- Ken

 --
 Ken Krugler
 +1 530-210-6378
 http://www.scaleunlimited.com
 custom big data solutions  training
 Hadoop, Cascading, Cassandra  Solr








Re: [jira] [Commented] (TIKA-93) OCR support

2013-12-24 Thread Oleg Tikhonov
Hi Frank,

It's not so easy especially having dependency on native libraries.
It's also depends on trained profiles, languages  fonts.

The questions are - what are platforms we want to support. what are
languages and fonts.

BR,
Oleg


On Tue, Dec 24, 2013 at 9:48 AM, frank (JIRA) j...@apache.org wrote:


 [
 https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13856214#comment-13856214]

 frank commented on TIKA-93:
 ---

 this feature is really useful and helpful.

  OCR support
  ---
 
  Key: TIKA-93
  URL: https://issues.apache.org/jira/browse/TIKA-93
  Project: Tika
   Issue Type: New Feature
   Components: parser
 Reporter: Jukka Zitting
 Priority: Minor
 
  I don't know of any decent open source pure Java OCR libraries, but
 there are command line OCR tools like Tesseract (
 http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to
 extract text content (where available) from image files.



 --
 This message was sent by Atlassian JIRA
 (v6.1.5#6160)



Re: [VOTE] Apache Tika 1.5 RC1

2014-02-04 Thread Oleg Tikhonov
Hi David,
 [x] +1 Release this package as Apache Tika 1.5

Thanks!
BR,
Oleg


On Wed, Feb 5, 2014 at 3:59 AM, David Meikle loo...@gmail.com wrote:

 Hi Guys,

 A candidate for the Tika 1.5 release is now available at:
 http://people.apache.org/~dmeikle/tika-1.5-rc1/

 The release candidate is a zip archive of the sources in:
 http://svn.apache.org/repos/asf/tika/tags/1.5-rc1/

 The SHA1 checksum of the archive is:
 66adb7e73058da73a055a823bd61af48129c1179

 A staged M2 repository can also be found on repository.apache.org here:
 https://repository.apache.org/content/repositories/orgapachetika-1000

 Please vote on releasing this package as Apache Tika 1.5.
 The vote is open for the next 72 hours and passes if a majority of at
 least three +1 Tika PMC votes are cast.

[ ] +1 Release this package as Apache Tika 1.5
[ ] -1 Do not release this package because...

 Here is my +1 for the release.

 Cheers,
 Dave


Re: [jira] [Commented] (TIKA-93) OCR support

2014-02-08 Thread Oleg Tikhonov
Hi Grant,
what you're doing seems great.
I've checked the Tess4j (http://tess4j.sourceforge.net/) they released and
distributed under the Apache License,
v2.0http://www.apache.org/licenses/LICENSE-2.0.html
.

Hope it helps.

BR,
Oleg



On Sat, Feb 8, 2014 at 1:14 PM, Grant Ingersoll (JIRA) j...@apache.orgwrote:


 [
 https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13895514#comment-13895514]

 Grant Ingersoll commented on TIKA-93:
 -

 It can, via some ancient JavaIO stuff, which, in some cases, has some
 weird dependencies.  Still working this out, but the way this is shaping up
 is that it is all going to have to be very pluggable to avoid any of these
 cases.  If anyone is up for lobbying the Tess4J team to remove
 GPL/LGPL/viral dependencies, we'd be in much better shape.

  OCR support
  ---
 
  Key: TIKA-93
  URL: https://issues.apache.org/jira/browse/TIKA-93
  Project: Tika
   Issue Type: New Feature
   Components: parser
 Reporter: Jukka Zitting
 Priority: Minor
 
  I don't know of any decent open source pure Java OCR libraries, but
 there are command line OCR tools like Tesseract (
 http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to
 extract text content (where available) from image files.



 --
 This message was sent by Atlassian JIRA
 (v6.1.5#6160)



Re: [jira] [Commented] (TIKA-93) OCR support

2014-02-08 Thread Oleg Tikhonov
Hi,
There is another code coverage maven plug-in, called cobertura.
If you run *mvn clean install cobertura:cobertura* no need to put it in the
pom.

Hope it helps.




On Sat, Feb 8, 2014 at 10:17 PM, Grant Ingersoll (JIRA) j...@apache.orgwrote:


 [
 https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13895718#comment-13895718]

 Grant Ingersoll commented on TIKA-93:
 -

 bq. what is the dependency on jacoco in tika-parent? That stuff seems
 orthogonal to the patch.

 I put that in so that I can measure whether I am testing sufficiently.  I
 can separate it out to a different patch.

 bq. dependency on custom external Maven repo – myGrid – any way to get the
 jar from the Central repo somewhere? we have made an effort in Tika to
 remove any specific deps on external repositories

 We could make that one optional.  All it does is add support for TIFF and
 a few other file formats that aren't part of the standard ImageIO.

 bq.  in my CS572 class on Search Engines where we look at FBI Vault PDF
 files!  http://www-scf.usc.edu/~csci572/

 I read your abstract for your talk and checked out the Vault and thought
 it would be cool, too.  The main issue is that JavaOCR needs to be trained
 in order to work with that data set.  Tesseract, on the other hand, works
 for it, but alas, needs to be implemented as an OCRParser.  Since Tess4J
 has some bad deps, the only way I could see to do this is to exec the
 process or go write my own JNI integration for Tesseract.  The latter isn't
 likely to happen.  The former feels less than desirable, but would work.

  OCR support
  ---
 
  Key: TIKA-93
  URL: https://issues.apache.org/jira/browse/TIKA-93
  Project: Tika
   Issue Type: New Feature
   Components: parser
 Reporter: Jukka Zitting
 Assignee: Chris A. Mattmann
 Priority: Minor
  Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch
 
 
  I don't know of any decent open source pure Java OCR libraries, but
 there are command line OCR tools like Tesseract (
 http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to
 extract text content (where available) from image files.



 --
 This message was sent by Atlassian JIRA
 (v6.1.5#6160)



Re: [jira] [Commented] (TIKA-93) OCR support

2014-02-10 Thread Oleg Tikhonov
@Timo,
On the other hand this Parser can serves as a Composite for more
complicated parsers.
For example of DejaVu, you can extract images and parse them one by one,
and after just to append extracted text.


BR,
Oleg


On Mon, Feb 10, 2014 at 11:09 AM, Timo Boehme (JIRA) j...@apache.orgwrote:


 [
 https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13896339#comment-13896339]

 Timo Boehme commented on TIKA-93:
 -

 I would like to give some comments on detecting/handling of image based
 PDFs because the proposed solution will only work with a subset of these
 kind of documents. First one could classify the image based PDF into 3
 classes:
 # image only (one image per page)
 # image with text overlay/underlay already produced by an OCR process
 # multiple images per page (instead of one full page image there are
 images per word/line/paragraph)

 Thus from only testing for a page size image one does not known if we
 nevertheless have parseable text or if we have a class 3 document (in case
 of e.g. journals we might even have a full page background image). For an
 automatic classification one would need to first try to parse text in the
 standard way for a view pages. One should not expect image-only PDFs to
 contain no text - in some cases header/footer/page numbers are added as
 text whereas other content is only an image. An heuristic threshold are
 60-80 characters per page below which we can assume to have an image PDF.
 If a PDF is assumed to be an image PDF the pages should be 'printed' into
 an image (in order to also handle class 3 documents and to keep mixed data
 (image + text)) and this image should be processed by OCR.

 Best,
 Timo

  OCR support
  ---
 
  Key: TIKA-93
  URL: https://issues.apache.org/jira/browse/TIKA-93
  Project: Tika
   Issue Type: New Feature
   Components: parser
 Reporter: Jukka Zitting
 Assignee: Chris A. Mattmann
 Priority: Minor
  Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch,
 TIKA-93.patch, testOCR.docx, testOCR.pdf, testOCR.pptx
 
 
  I don't know of any decent open source pure Java OCR libraries, but
 there are command line OCR tools like Tesseract (
 http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to
 extract text content (where available) from image files.



 --
 This message was sent by Atlassian JIRA
 (v6.1.5#6160)



Re: Searching for Tika Jira issues using Lucene

2014-03-05 Thread Oleg Tikhonov
Hi Mike!
Sounds great! Thanks.

Oleg


On Wed, Mar 5, 2014 at 6:47 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 Team,

 If you want to search for Tika Jira issues, I just added Tika coverage
 into the Lucene dog food server we use for finding Lucene/Solr
 issues at http://jirasearch.mikemccandless.com.

 I just posted a blog post describing recent changes:


 http://blog.mikemccandless.com/2014/03/using-lucenes-search-server-to-search.html

 Basically I started this as an effort to test Lucene's functionality
 in a real application/server (searching for issues), and to eat our
 own dog food, but then over time I think it's proven quite useful
 and I now use it almost exclusively when I need to find a Lucene issue.

 Compared to Jira's builtin search, it's more full text like; e.g.,
 makes suggestions as you type, produces snippets and highlights, ranks
 by blended relevence+recency, etc.  It has facets so you can quickly
 drill down/sideways by various metadata.  In the results, you can
 click on a snippet to go straight to the specific comment and issue
 that it came from.

 It uses Lucene's near-real-time indexing + searching, so issue updates
 should be visible within ~ 30 seconds or so.

 I hope you find it useful too!

 Mike McCandless

 http://blog.mikemccandless.com



Re: [jira] [Updated] (TIKA-1276) Missing embedded dependencies in tika-bundle

2014-04-28 Thread Oleg Tikhonov
Hi Rupert,
agree about
javax.servlet;resolution:=optional,
javax.servlet.http;resolution:=optional,

Will check it out tomorrow.

Thanks !!!


On Mon, Apr 28, 2014 at 4:44 PM, Rupert Westenthaler (JIRA) j...@apache.org
 wrote:


  [
 https://issues.apache.org/jira/browse/TIKA-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]

 Rupert Westenthaler updated TIKA-1276:
 --

 Attachment: TIKA-1276_20140428_2_rwesten.diff

 Attached a revised patch (TIKA-1276_20140428_2_rwesten.diff) that makes
 the `javax.servlet` API an optional dependency

  Missing embedded dependencies in tika-bundle
  
 
  Key: TIKA-1276
  URL: https://issues.apache.org/jira/browse/TIKA-1276
  Project: Tika
   Issue Type: Bug
   Components: packaging
 Affects Versions: 1.5
  Environment: OSGI, Apache Felix via Apache Sling Launcher
 Reporter: Rupert Westenthaler
  Fix For: 1.6
 
  Attachments: TIKA-1276_20140423_rwesten.diff,
 TIKA-1276_20140428_2_rwesten.diff, TIKA-1276_20140428_rwesten.diff
 
 
  While updating from tika 1.2 to 1.5 I that the
 `org.apache.tika:tika-bundle:1.5` module has some missing dependences.
  1. `com.uwyn:jhighlight:1.0` is not embedded
  Because of that installing the bundle results in the following exception
  {code}
  org.osgi.framework.BundleException: Unresolved constraint in bundle
 org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement
 [103.0] osgi.wiring.package;
 (osgi.wiring.package=com.uwyn.jhighlight.renderer))
  org.osgi.framework.BundleException: Unresolved constraint in bundle
 org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement
 [103.0] osgi.wiring.package;
 (osgi.wiring.package=com.uwyn.jhighlight.renderer)
at
 org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
at
 org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
at
 org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
at java.lang.Thread.run(Thread.java:744)
  {code}
  2. `org.ow2.asm:asm:4.1` is not embedded because
 `org.apache.tika:tika-core:1.5` uses `org.ow2.asm-debug-all:asm:4.1` and
 therefore the `Embed-Dependency` directive `asm` does not match any
 dependency.
  Because of that one do get the following exception (after fixing (1))
  {code}
  org.osgi.framework.BundleException: Unresolved constraint in bundle
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement
 [96.0] osgi.wiring.package;
 ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0
  org.osgi.framework.BundleException: Unresolved constraint in bundle
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement
 [96.0] osgi.wiring.package;
 ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0)))
at
 org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
at
 org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
at
 org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
at java.lang.Thread.run(Thread.java:744)
  {code}
  There are two possibilities to fix this (a) change the
 `Embed-Dependency` to `asm-debug-all` or adding a dependency to
 `org.ow2.asm:asm:4.1` to the tika-bundle pom file.
  3. `edu.ucar:netcdf:4.2-min` is not embedded
  Because of that one does get the following exception (after fixing (1)
 and (2))
  {code}
  org.osgi.framework.BundleException: Unresolved constraint in bundle
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement
 [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2))
  org.osgi.framework.BundleException: Unresolved constraint in bundle
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement
 [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2)
at
 org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
at
 org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
at
 org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
at java.lang.Thread.run(Thread.java:744)
  {code}
  4. The `com.adobe.xmp:xmpcore:5.1.2` dependency is required at runtime
  After fixing the above issues the tika-bundle was started successfully.
 However when extracting EXIG metadata from a jpeg image I got the following
 exception.
  {code}
  java.lang.NoClassDefFoundError: com/adobe/xmp/XMPException
at
 

Re: [jira] [Commented] (TIKA-1276) Missing embedded dependencies in tika-bundle

2014-04-29 Thread Oleg Tikhonov
No problem. Will test it.


On Tue, Apr 29, 2014 at 3:43 PM, Rupert Westenthaler (JIRA) j...@apache.org
 wrote:


 [
 https://issues.apache.org/jira/browse/TIKA-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984251#comment-13984251]

 Rupert Westenthaler commented on TIKA-1276:
 ---

 Personally I am happy with having Tika running within an OSGI environment.
 AFAIK Tika uses the Java ServiceLoader for locating different parsers.
 ServiceLoader can only locate services within the same bundle. So the fact
 that the ClassLoader of the tika-bundle is parsed to the TikaConfig by the
 Activator is the reason why the current solution works.

 As soon as one would like to have different parsers in different bundles
 one would need to also provide an alternative implementation to the
 DefaultParser for OSGI. Implementing a CompositeParser that tracks
 available parsers via the OSGI service registry would for sure be the way
 to go.

 To register single parsers as OSGI service one would only need to add
 @Component and @Service annotations. As this are not runtime annotation
 this would not even add any runtime dependencies.

 Implementing it that way would even allow to dynamically add/remove
 parsers at runtime. Components with a dependency to the CompositeParser
 could even keep using their service object.

  Missing embedded dependencies in tika-bundle
  
 
  Key: TIKA-1276
  URL: https://issues.apache.org/jira/browse/TIKA-1276
  Project: Tika
   Issue Type: Bug
   Components: packaging
 Affects Versions: 1.5
  Environment: OSGI, Apache Felix via Apache Sling Launcher
 Reporter: Rupert Westenthaler
  Fix For: 1.6
 
  Attachments: TIKA-1276_20140423_rwesten.diff,
 TIKA-1276_20140428_2_rwesten.diff, TIKA-1276_20140428_3_rwesten.diff,
 TIKA-1276_20140428_rwesten.diff
 
 
  While updating from tika 1.2 to 1.5 I that the
 `org.apache.tika:tika-bundle:1.5` module has some missing dependences.
  1. `com.uwyn:jhighlight:1.0` is not embedded
  Because of that installing the bundle results in the following exception
  {code}
  org.osgi.framework.BundleException: Unresolved constraint in bundle
 org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement
 [103.0] osgi.wiring.package;
 (osgi.wiring.package=com.uwyn.jhighlight.renderer))
  org.osgi.framework.BundleException: Unresolved constraint in bundle
 org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement
 [103.0] osgi.wiring.package;
 (osgi.wiring.package=com.uwyn.jhighlight.renderer)
at
 org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
at
 org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
at
 org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
at java.lang.Thread.run(Thread.java:744)
  {code}
  2. `org.ow2.asm:asm:4.1` is not embedded because
 `org.apache.tika:tika-core:1.5` uses `org.ow2.asm-debug-all:asm:4.1` and
 therefore the `Embed-Dependency` directive `asm` does not match any
 dependency.
  Because of that one do get the following exception (after fixing (1))
  {code}
  org.osgi.framework.BundleException: Unresolved constraint in bundle
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement
 [96.0] osgi.wiring.package;
 ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0
  org.osgi.framework.BundleException: Unresolved constraint in bundle
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement
 [96.0] osgi.wiring.package;
 ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0)))
at
 org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
at
 org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
at
 org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
at java.lang.Thread.run(Thread.java:744)
  {code}
  There are two possibilities to fix this (a) change the
 `Embed-Dependency` to `asm-debug-all` or adding a dependency to
 `org.ow2.asm:asm:4.1` to the tika-bundle pom file.
  3. `edu.ucar:netcdf:4.2-min` is not embedded
  Because of that one does get the following exception (after fixing (1)
 and (2))
  {code}
  org.osgi.framework.BundleException: Unresolved constraint in bundle
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement
 [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2))
  org.osgi.framework.BundleException: Unresolved constraint in bundle
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement
 [96.0] 

Re: [jira] [Commented] (TIKA-93) OCR support

2014-05-29 Thread Oleg Tikhonov
Guys,
Tesseract is by itself a project that written on C/C++ and should be
compiled differently for each platform.
Personally, i would put a requirement for those who want to work with
tesseract. Not sure that putting Tesseract in the sources is a right way to
go.

How good tesseract is -  depends on trained data at least + quality of
the input images. No simple answer exists.

BR,
Oleg


On Thu, May 29, 2014 at 11:07 PM, Luis Filipe Nassif (JIRA) j...@apache.org
 wrote:


 [
 https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012810#comment-14012810]

 Luis Filipe Nassif commented on TIKA-93:
 

 Thank you very much [~tpalsulich] for including unit tests! We could also
 include tests for normal images (not embedded).

 There is a simple timeout control that throws a TikaException with
 specific message if it happens. The idea to force setting a
 TesseractOCRConfig object in parseContext to run OCR is to not affect users
 that do not want OCR, exactly because it could take seconds, even minutes.
 So TesseractOCRParser can be included in Tika Parser list by default with
 no problem. We also could include a warning about OCR slowness in the class
 description.

 I have no idea how to include Tesseract in the sources. Maybe Tika
 commiters can help with this?

  OCR support
  ---
 
  Key: TIKA-93
  URL: https://issues.apache.org/jira/browse/TIKA-93
  Project: Tika
   Issue Type: New Feature
   Components: parser
 Reporter: Jukka Zitting
 Assignee: Chris A. Mattmann
 Priority: Minor
  Fix For: 1.6
 
  Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch,
 TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch,
 TesseractOCR_Tyler.patch, testOCR.docx, testOCR.pdf, testOCR.pptx
 
 
  I don't know of any decent open source pure Java OCR libraries, but
 there are command line OCR tools like Tesseract (
 http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to
 extract text content (where available) from image files.



 --
 This message was sent by Atlassian JIRA
 (v6.2#6252)



Re: Stack Overflow Question

2014-06-30 Thread Oleg Tikhonov
Hi,
Please have a look at provided code:
[code]
Parser parser = new AutoDetectParser(); // Should auto-detect!
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();

InputStream stream = ZipParserTest.class.getResourceAsStream(
/test-documents/test-documents.zip);
try {
parser.parse(stream, handler, metadata, recursingContext);
} finally {
stream.close();
}
[/code]

Hope it helps.
Let me know how it goes.

BR,
Oleg


On Mon, Jun 30, 2014 at 8:27 PM, yeshwanth kumar yeshwant...@gmail.com
wrote:

 Unable tp read zipfile using Apache Tika
 http://stackoverflow.com/q/24495504/1899893?sem=2



Re: [VOTE] Apache Tika 1.6 release candidate #1

2014-07-27 Thread Oleg Tikhonov
[x] +1 Release this package as Apache Tika 1.6.

Tested on the following systems:
1. Microsoft Windows 7 Enterprise, SP 1, x64-based PC
2. Linux ubuntu 3.11.0-24-generic #42-Ubuntu SMP x86_64 GNU/Linux

Thanks,
Oleg



On Mon, Jul 28, 2014 at 7:22 AM, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

 Hi Folks,

 A candidate for the Tika 1.6 release is available at:

 http://people.apache.org/~mattmann/apache-tika-1.6/rc1/


 The release candidate is a zip archive of the sources in:

 http://svn.apache.org/repos/asf/tika/tags/1.6/

 The SHA1 checksum of the archive is
 076ad343be56a540a4c8e395746fa4fda5b5b6d3.

 A Maven staging repository is available at:

 https://repository.apache.org/content/repositories/orgapachetika-1003/


 Please vote on releasing this package as Apache Tika 1.6.
 The vote is open for the next 72 hours and passes if a majority of at
 least three +1 Tika PMC votes are cast.

 [ ] +1 Release this package as Apache Tika 1.6
 [ ] -1 Do not release this package becauseŠ

 Thank you!

 Cheers,
 Chris

 P.S. Here is my +1!








Re: [jira] [Created] (TIKA-1405) German content detected as French

2014-08-30 Thread Oleg Tikhonov
Hi,
does context contain only one language or it's mixed.
if the text contains a single language then it seems something strange in
our language profiles. If it mixed - then it kindda ok. The first detected
will be an answer.

What is a size of context? one word or bunch of text? Basically to detect
language on big text is more precise then on small.

Best regards,
Oleg


On Sat, Aug 30, 2014 at 1:13 PM, Zaheer Beig (JIRA) j...@apache.org wrote:

 Zaheer Beig created TIKA-1405:
 -

  Summary: German content detected as French
  Key: TIKA-1405
  URL: https://issues.apache.org/jira/browse/TIKA-1405
  Project: Tika
   Issue Type: Bug
   Components: languageidentifier
 Affects Versions: 1.4
  Environment: Linux
 Reporter: Zaheer Beig


 Hi,
 We are using Apache Tika 1.4  for document conversion to text and language
 detection in one of our project. We are facing below issues with language
 detection:

 1. When the text is in all UPPER CASE, even though the language is
 English, it gets detected as Estonian.
 2. For many of our German content , language gets detected as French
 [Though this is not the case for all German content]

 Any update on this will be very helpful.



 --
 This message was sent by Atlassian JIRA
 (v6.2#6252)



Re: 1.7 release?

2014-10-20 Thread Oleg Tikhonov
Hi, I can try this on.
What is a trunk?


Thanks,
Oleg

On Tue, Oct 21, 2014 at 6:21 AM, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

 Hmm any idea why this is failing on Windows? Tyler P. and
 I were talking the other day - maybe we shouldn't run the
 tests from TIKA-1422 unless Tesseract is installed? Thoughts?

 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






 -Original Message-
 From: Hong-Thai Nguyen thaicha...@gmail.com
 Reply-To: dev@tika.apache.org dev@tika.apache.org
 Date: Thursday, October 16, 2014 at 2:03 AM
 To: dev@tika.apache.org dev@tika.apache.org
 Subject: Re: 1.7 release?

 Hi Andrzej,
 
 We are impatient for 1.7 release too.
 I'm having compiling problem of TIKA-1422 on me. If anyone can build
 successfully on Windows, I have no objection to release 1.7
 
 Thanks,
 
 On Thu, Oct 16, 2014 at 10:51 AM, Andrzej Białecki a...@getopt.org wrote:
 
  Hi,
 
  Any news on the 1.7 release? or at least a 1.6.1 release that includes
 the
  fix for broken ODF parsing...
 
  ---
  Best regards,
 
  Andrzej Bialecki
 
 
 
 
 --
 --
 Hong-Thai




Re: 1.7 release?

2014-10-21 Thread Oleg Tikhonov
Taken. Thanks. in progress ...

On Tue, Oct 21, 2014 at 8:54 AM, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

 Trunk is the current checkout/branch:

 http://svn.apache.org/repos/asf/tika/trunk


 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






 -Original Message-
 From: Oleg Tikhonov olegtikho...@gmail.com
 Reply-To: dev@tika.apache.org dev@tika.apache.org
 Date: Monday, October 20, 2014 at 10:16 PM
 To: dev@tika.apache.org dev@tika.apache.org
 Subject: Re: 1.7 release?

 Hi, I can try this on.
 What is a trunk?
 
 
 Thanks,
 Oleg
 
 On Tue, Oct 21, 2014 at 6:21 AM, Mattmann, Chris A (3980) 
 chris.a.mattm...@jpl.nasa.gov wrote:
 
  Hmm any idea why this is failing on Windows? Tyler P. and
  I were talking the other day - maybe we shouldn't run the
  tests from TIKA-1422 unless Tesseract is installed? Thoughts?
 
  ++
  Chris Mattmann, Ph.D.
  Chief Architect
  Instrument Software and Science Data Systems Section (398)
  NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
  Office: 168-519, Mailstop: 168-527
  Email: chris.a.mattm...@nasa.gov
  WWW:  http://sunset.usc.edu/~mattmann/
  ++
  Adjunct Associate Professor, Computer Science Department
  University of Southern California, Los Angeles, CA 90089 USA
  ++
 
 
 
 
 
 
  -Original Message-
  From: Hong-Thai Nguyen thaicha...@gmail.com
  Reply-To: dev@tika.apache.org dev@tika.apache.org
  Date: Thursday, October 16, 2014 at 2:03 AM
  To: dev@tika.apache.org dev@tika.apache.org
  Subject: Re: 1.7 release?
 
  Hi Andrzej,
  
  We are impatient for 1.7 release too.
  I'm having compiling problem of TIKA-1422 on me. If anyone can build
  successfully on Windows, I have no objection to release 1.7
  
  Thanks,
  
  On Thu, Oct 16, 2014 at 10:51 AM, Andrzej Białecki a...@getopt.org
 wrote:
  
   Hi,
  
   Any news on the 1.7 release? or at least a 1.6.1 release that
 includes
  the
   fix for broken ODF parsing...
  
   ---
   Best regards,
  
   Andrzej Bialecki
  
  
  
  
  --
  --
  Hong-Thai
 
 




Re: 1.7 release?

2014-10-21 Thread Oleg Tikhonov
Please take a try with newest patch.
Cheers,
Oleg

On Tue, Oct 21, 2014 at 9:08 AM, Oleg Tikhonov olegtikho...@gmail.com
wrote:

 Taken. Thanks. in progress ...

 On Tue, Oct 21, 2014 at 8:54 AM, Mattmann, Chris A (3980) 
 chris.a.mattm...@jpl.nasa.gov wrote:

 Trunk is the current checkout/branch:

 http://svn.apache.org/repos/asf/tika/trunk


 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






 -Original Message-
 From: Oleg Tikhonov olegtikho...@gmail.com
 Reply-To: dev@tika.apache.org dev@tika.apache.org
 Date: Monday, October 20, 2014 at 10:16 PM
 To: dev@tika.apache.org dev@tika.apache.org
 Subject: Re: 1.7 release?

 Hi, I can try this on.
 What is a trunk?
 
 
 Thanks,
 Oleg
 
 On Tue, Oct 21, 2014 at 6:21 AM, Mattmann, Chris A (3980) 
 chris.a.mattm...@jpl.nasa.gov wrote:
 
  Hmm any idea why this is failing on Windows? Tyler P. and
  I were talking the other day - maybe we shouldn't run the
  tests from TIKA-1422 unless Tesseract is installed? Thoughts?
 
  ++
  Chris Mattmann, Ph.D.
  Chief Architect
  Instrument Software and Science Data Systems Section (398)
  NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
  Office: 168-519, Mailstop: 168-527
  Email: chris.a.mattm...@nasa.gov
  WWW:  http://sunset.usc.edu/~mattmann/
  ++
  Adjunct Associate Professor, Computer Science Department
  University of Southern California, Los Angeles, CA 90089 USA
  ++
 
 
 
 
 
 
  -Original Message-
  From: Hong-Thai Nguyen thaicha...@gmail.com
  Reply-To: dev@tika.apache.org dev@tika.apache.org
  Date: Thursday, October 16, 2014 at 2:03 AM
  To: dev@tika.apache.org dev@tika.apache.org
  Subject: Re: 1.7 release?
 
  Hi Andrzej,
  
  We are impatient for 1.7 release too.
  I'm having compiling problem of TIKA-1422 on me. If anyone can build
  successfully on Windows, I have no objection to release 1.7
  
  Thanks,
  
  On Thu, Oct 16, 2014 at 10:51 AM, Andrzej Białecki a...@getopt.org
 wrote:
  
   Hi,
  
   Any news on the 1.7 release? or at least a 1.6.1 release that
 includes
  the
   fix for broken ODF parsing...
  
   ---
   Best regards,
  
   Andrzej Bialecki
  
  
  
  
  --
  --
  Hong-Thai
 
 





Re: 1.7 release?

2014-10-21 Thread Oleg Tikhonov
Sorry!!!

On Tue, Oct 21, 2014 at 9:37 AM, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

 Thanks Oleg, will try tomorrow for me Los angeles time!

 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






 -Original Message-
 From: Oleg Tikhonov o...@apache.org
 Reply-To: dev@tika.apache.org dev@tika.apache.org
 Date: Monday, October 20, 2014 at 11:20 PM
 To: dev@tika.apache.org dev@tika.apache.org
 Subject: Re: 1.7 release?

 Please take a try with newest patch.
 Cheers,
 Oleg
 
 On Tue, Oct 21, 2014 at 9:08 AM, Oleg Tikhonov olegtikho...@gmail.com
 wrote:
 
  Taken. Thanks. in progress ...
 
  On Tue, Oct 21, 2014 at 8:54 AM, Mattmann, Chris A (3980) 
  chris.a.mattm...@jpl.nasa.gov wrote:
 
  Trunk is the current checkout/branch:
 
  http://svn.apache.org/repos/asf/tika/trunk
 
 
  ++
  Chris Mattmann, Ph.D.
  Chief Architect
  Instrument Software and Science Data Systems Section (398)
  NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
  Office: 168-519, Mailstop: 168-527
  Email: chris.a.mattm...@nasa.gov
  WWW:  http://sunset.usc.edu/~mattmann/
  ++
  Adjunct Associate Professor, Computer Science Department
  University of Southern California, Los Angeles, CA 90089 USA
  ++
 
 
 
 
 
 
  -Original Message-
  From: Oleg Tikhonov olegtikho...@gmail.com
  Reply-To: dev@tika.apache.org dev@tika.apache.org
  Date: Monday, October 20, 2014 at 10:16 PM
  To: dev@tika.apache.org dev@tika.apache.org
  Subject: Re: 1.7 release?
 
  Hi, I can try this on.
  What is a trunk?
  
  
  Thanks,
  Oleg
  
  On Tue, Oct 21, 2014 at 6:21 AM, Mattmann, Chris A (3980) 
  chris.a.mattm...@jpl.nasa.gov wrote:
  
   Hmm any idea why this is failing on Windows? Tyler P. and
   I were talking the other day - maybe we shouldn't run the
   tests from TIKA-1422 unless Tesseract is installed? Thoughts?
  
   ++
   Chris Mattmann, Ph.D.
   Chief Architect
   Instrument Software and Science Data Systems Section (398)
   NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
   Office: 168-519, Mailstop: 168-527
   Email: chris.a.mattm...@nasa.gov
   WWW:  http://sunset.usc.edu/~mattmann/
   ++
   Adjunct Associate Professor, Computer Science Department
   University of Southern California, Los Angeles, CA 90089 USA
   ++
  
  
  
  
  
  
   -Original Message-
   From: Hong-Thai Nguyen thaicha...@gmail.com
   Reply-To: dev@tika.apache.org dev@tika.apache.org
   Date: Thursday, October 16, 2014 at 2:03 AM
   To: dev@tika.apache.org dev@tika.apache.org
   Subject: Re: 1.7 release?
  
   Hi Andrzej,
   
   We are impatient for 1.7 release too.
   I'm having compiling problem of TIKA-1422 on me. If anyone can
 build
   successfully on Windows, I have no objection to release 1.7
   
   Thanks,
   
   On Thu, Oct 16, 2014 at 10:51 AM, Andrzej Białecki a...@getopt.org
  wrote:
   
Hi,
   
Any news on the 1.7 release? or at least a 1.6.1 release that
  includes
   the
fix for broken ODF parsing...
   
---
Best regards,
   
Andrzej Bialecki
   
   
   
   
   --
   --
   Hong-Thai
  
  
 
 
 




Re: 1.7 release?

2014-10-24 Thread Oleg Tikhonov
Hi Tyler,
don't mention.

Cheers,
Oleg
On Oct 24, 2014 8:02 PM, Tyler Palsulich tpalsul...@gmail.com wrote:

 Thank you for the help, Oleg! I just resolved TIKA-1422. So, are there any
 other issues anyone would like to resolve before a new release?

 Thanks,
 Tyler

 On Tue, Oct 21, 2014 at 2:42 AM, Oleg Tikhonov olegtikho...@gmail.com
 wrote:

  Sorry!!!
 
  On Tue, Oct 21, 2014 at 9:37 AM, Mattmann, Chris A (3980) 
  chris.a.mattm...@jpl.nasa.gov wrote:
 
   Thanks Oleg, will try tomorrow for me Los angeles time!
  
   ++
   Chris Mattmann, Ph.D.
   Chief Architect
   Instrument Software and Science Data Systems Section (398)
   NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
   Office: 168-519, Mailstop: 168-527
   Email: chris.a.mattm...@nasa.gov
   WWW:  http://sunset.usc.edu/~mattmann/
   ++
   Adjunct Associate Professor, Computer Science Department
   University of Southern California, Los Angeles, CA 90089 USA
   ++
  
  
  
  
  
  
   -Original Message-
   From: Oleg Tikhonov o...@apache.org
   Reply-To: dev@tika.apache.org dev@tika.apache.org
   Date: Monday, October 20, 2014 at 11:20 PM
   To: dev@tika.apache.org dev@tika.apache.org
   Subject: Re: 1.7 release?
  
   Please take a try with newest patch.
   Cheers,
   Oleg
   
   On Tue, Oct 21, 2014 at 9:08 AM, Oleg Tikhonov 
 olegtikho...@gmail.com
   wrote:
   
Taken. Thanks. in progress ...
   
On Tue, Oct 21, 2014 at 8:54 AM, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:
   
Trunk is the current checkout/branch:
   
http://svn.apache.org/repos/asf/tika/trunk
   
   
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++
   
   
   
   
   
   
-Original Message-
From: Oleg Tikhonov olegtikho...@gmail.com
Reply-To: dev@tika.apache.org dev@tika.apache.org
Date: Monday, October 20, 2014 at 10:16 PM
To: dev@tika.apache.org dev@tika.apache.org
Subject: Re: 1.7 release?
   
Hi, I can try this on.
What is a trunk?


Thanks,
Oleg

On Tue, Oct 21, 2014 at 6:21 AM, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

 Hmm any idea why this is failing on Windows? Tyler P. and
 I were talking the other day - maybe we shouldn't run the
 tests from TIKA-1422 unless Tesseract is installed? Thoughts?


 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/

 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA

 ++






 -Original Message-
 From: Hong-Thai Nguyen thaicha...@gmail.com
 Reply-To: dev@tika.apache.org dev@tika.apache.org
 Date: Thursday, October 16, 2014 at 2:03 AM
 To: dev@tika.apache.org dev@tika.apache.org
 Subject: Re: 1.7 release?

 Hi Andrzej,
 
 We are impatient for 1.7 release too.
 I'm having compiling problem of TIKA-1422 on me. If anyone can
   build
 successfully on Windows, I have no objection to release 1.7
 
 Thanks,
 
 On Thu, Oct 16, 2014 at 10:51 AM, Andrzej Białecki 
  a...@getopt.org
wrote:
 
  Hi,
 
  Any news on the 1.7 release? or at least a 1.6.1 release that
includes
 the
  fix for broken ODF parsing...
 
  ---
  Best regards,
 
  Andrzej Bialecki
 
 
 
 
 --
 --
 Hong-Thai


   
   
   
  
  
 



Re: [jira] [Created] (TIKA-1543) TesseractOCRParser.setTesseractPath() doesn't work on Linux

2015-02-06 Thread Oleg Tikhonov
Hi,
Just one quess. Did you check the permissons, does it have executable
permission?

Br,
Oleg
On 6 Feb 2015 12:15, Sean Zhao (JIRA) j...@apache.org wrote:

 Sean Zhao created TIKA-1543:
 ---

  Summary: TesseractOCRParser.setTesseractPath() doesn't work
 on Linux
  Key: TIKA-1543
  URL: https://issues.apache.org/jira/browse/TIKA-1543
  Project: Tika
   Issue Type: Bug
   Components: parser
 Affects Versions: 1.7
 Reporter: Sean Zhao
  Fix For: 1.7


 After call setTesseractPath() to set the Tesseract path to a not-default
 path, like /root/tesseract , call the TesseractOCRParser.parse(), nothing
 will return.
 Not sure if this is related to TIKA-1421.



 --
 This message was sent by Atlassian JIRA
 (v6.3.4#6332)



Re: [DISCUSS] Tika 1.8 or 1.7.1

2015-03-29 Thread Oleg Tikhonov
+1 for 1.8 release.
On 29 Mar 2015 02:04, Konstantin Gribov gros...@gmail.com wrote:

 Also, I think, we should resolve TIKA-1575 (upgrade to pdfbox 1.8.9) since
 pdfbox 1.8.8 hangs on some pdf forms.

 --
 Best regards,
 Konstantin Gribov

 сб, 28 марта 2015 г. в 23:22, Konstantin Gribov gros...@gmail.com:

  +1 to releasing 1.8.
 
  --
  Best regards,
  Konstantin Gribov
 
  сб, 28 марта 2015, 22:25, Tyler Palsulich tpalsul...@apache.org:
 
  I'm also leaning toward 1.8. Especially given the newly identified
  regression in TIKA-1584.
 
  Tyler
  On Mar 28, 2015 11:47 AM, Mattmann, Chris A (3980) 
  chris.a.mattm...@jpl.nasa.gov wrote:
 
   Hi Tyler - I would VOTE for 1.8. Given the stuff associated
   with releasing (updating the website; sending emails; waiting
   periods, etc.) let’s ship all the updates we have too along
   with the jhighlight fix.
  
   Cheers,
   Chris
  
   ++
   Chris Mattmann, Ph.D.
   Chief Architect
   Instrument Software and Science Data Systems Section (398)
   NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
   Office: 168-519, Mailstop: 168-527
   Email: chris.a.mattm...@nasa.gov
   WWW:  http://sunset.usc.edu/~mattmann/
   ++
   Adjunct Associate Professor, Computer Science Department
   University of Southern California, Los Angeles, CA 90089 USA
   ++
  
  
  
  
  
  
   -Original Message-
   From: Tyler Palsulich tpalsul...@apache.org
   Reply-To: dev@tika.apache.org dev@tika.apache.org
   Date: Saturday, March 28, 2015 at 8:01 AM
   To: dev@tika.apache.org dev@tika.apache.org
   Subject: [DISCUSS] Tika 1.8 or 1.7.1
  
   Hi Folks,
   
   Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need
  to
   release a new version of Tika. I'll volunteer to be the release
 manager
   again.
   
   Should we release this as 1.8 or 1.7.1?
   
   Does anyone have any last minute issues they'd like to finish and see
  in
   Tika 1.X? I'd like to get the example working with CORS (TIKA-1585
 and
   TIKA-1586). Any others?
   
   Have a good weekend,
   Tyler
  
  
 
 



Re: [jira] [Closed] (TIKA-993) Language Detection Fault

2015-03-02 Thread Oleg Tikhonov
Hi,
Just for the record ...
It can happen if a file contains context that at least written in two
different languages. For instance, the first half of file, say, is a German
and the second one, say ... a French. In such case detection would be
faulty.

Br,
Oleg
On 3 Mar 2015 04:03, Tyler Palsulich (JIRA) j...@apache.org wrote:


  [
 https://issues.apache.org/jira/browse/TIKA-993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

 Tyler Palsulich closed TIKA-993.
 
 Resolution: Cannot Reproduce

 This issue is 2 years old and has no attachment for the text. So, I'm
 closing as Cannot Reproduce. If you still have the text, please reopen!

  Language Detection Fault
  
 
  Key: TIKA-993
  URL: https://issues.apache.org/jira/browse/TIKA-993
  Project: Tika
   Issue Type: Bug
   Components: languageidentifier
 Reporter: Iman Reihanian
  Attachments: DetectorImpl.java
 
 
  This text's language is English but it detects as Italy.



 --
 This message was sent by Atlassian JIRA
 (v6.3.4#6332)



Re: [jira] [Closed] (TIKA-993) Language Detection Fault

2015-03-03 Thread Oleg Tikhonov
The first found. In this case will be German. Expexted result - a topic to
discuss. I would expect to get both detected languages. However it is
beyond tika's lang.dect.

Bottom line, so be it as is until Ken's implementation.
On 3 Mar 2015 09:09, Tyler Palsulich tpalsul...@gmail.com wrote:

 Hi,

 What do you mean, the detection is faulty? What is the expected result in
 that case?

 Thanks,
 Tyler
 On Mar 3, 2015 1:10 AM, Oleg Tikhonov o...@apache.org wrote:

  Hi,
  Just for the record ...
  It can happen if a file contains context that at least written in two
  different languages. For instance, the first half of file, say, is a
 German
  and the second one, say ... a French. In such case detection would be
  faulty.
 
  Br,
  Oleg
  On 3 Mar 2015 04:03, Tyler Palsulich (JIRA) j...@apache.org wrote:
 
  
[
  
 
 https://issues.apache.org/jira/browse/TIKA-993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
   ]
  
   Tyler Palsulich closed TIKA-993.
   
   Resolution: Cannot Reproduce
  
   This issue is 2 years old and has no attachment for the text. So, I'm
   closing as Cannot Reproduce. If you still have the text, please reopen!
  
Language Detection Fault

   
Key: TIKA-993
URL: https://issues.apache.org/jira/browse/TIKA-993
Project: Tika
 Issue Type: Bug
 Components: languageidentifier
   Reporter: Iman Reihanian
Attachments: DetectorImpl.java
   
   
This text's language is English but it detects as Italy.
  
  
  
   --
   This message was sent by Atlassian JIRA
   (v6.3.4#6332)
  
 



Re: trunk test failure

2015-03-26 Thread Oleg Tikhonov
Hi Chris,
just to confirm:

[INFO]

[INFO] Reactor Summary:
[INFO]
[INFO] Apache Tika parent . SUCCESS [
9.268 s]
[INFO] Apache Tika core ... SUCCESS [
25.823 s]
[INFO] Apache Tika parsers  SUCCESS [02:41
min]
[INFO] Apache Tika XMP  SUCCESS [
1.986 s]
[INFO] Apache Tika serialization .. SUCCESS [
1.604 s]
[INFO] Apache Tika batch .. SUCCESS [02:02
min]
[INFO] Apache Tika application  SUCCESS [
18.983 s]
[INFO] Apache Tika OSGi bundle  SUCCESS [
29.087 s]
[INFO] Apache Tika server . SUCCESS [
46.706 s]
[INFO] Apache Tika translate .. SUCCESS [
9.163 s]
[INFO] Apache Tika examples ... SUCCESS [
4.134 s]
[INFO] Apache Tika Java-7 Components .. SUCCESS [
1.236 s]
[INFO] Apache Tika  SUCCESS [
0.017 s]
[INFO]

[INFO] BUILD SUCCESS
[INFO]

[INFO] Total time: 07:20 min
[INFO] Finished at: 2015-03-26T09:18:46+02:00
[INFO] Final Memory: 91M/848M
[INFO]



BR,
OLeg

On Thu, Mar 26, 2015 at 1:21 AM, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

 OK I am nuts - I was applying the patch from TIKA-1580, but didn’t
 update Felix in the bundle pom - done now, building again. Yay.


 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






 -Original Message-
 From: Mattmann, Chris Mattmann chris.a.mattm...@jpl.nasa.gov
 Reply-To: dev@tika.apache.org dev@tika.apache.org
 Date: Wednesday, March 25, 2015 at 6:57 PM
 To: dev@tika.apache.org dev@tika.apache.org
 Subject: trunk test failure

 Hey Anyone else seeing this failure in trunk?
 
 Running org.apache.tika.bundle.BundleIT
 [main] INFO org.ops4j.pax.exam.spi.DefaultExamSystem - Pax Exam System
 (Version: 4.4.0) created.
 [main] INFO org.ops4j.pax.exam.junit.impl.ProbeRunner - creating PaxExam
 runner for class org.apache.tika.bundle.BundleIT
 [main] INFO org.ops4j.pax.exam.junit.impl.ProbeRunner - running test class
 org.apache.tika.bundle.BundleIT
 ERROR: Bundle org.apache.tika.bundle [17] Error starting
 file:/Users/mattmann/tmp/tika/tika-bundle/target/test-bundles/tika-bundle.
 j
 ar (org.osgi.framework.BundleException: Unresolved constraint in bundle
 org.apache.tika.bundle [17]: Unable to resolve 17.0: missing requirement
 [17.0] osgi.wiring.package;
 ((osgi.wiring.package=org.apache.commons.csv)(version=1.0.0)(!(version=
 2
 .0.0
 org.osgi.framework.BundleException: Unresolved constraint in bundle
 org.apache.tika.bundle [17]: Unable to resolve 17.0: missing requirement
 [17.0] osgi.wiring.package;
 ((osgi.wiring.package=org.apache.commons.csv)(version=1.0.0)(!(version=
 2
 .0.0)))
at
 org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:4097)
at org.apache.felix.framework.Felix.startBundle(Felix.java:2114)
at
 org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1368)
at
 org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevel
 I
 mpl.java:308)
at java.lang.Thread.run(Thread.java:745)
 [main] ERROR org.ops4j.pax.exam.nat.internal.NativeTestContainer - Bundle
 [org.apache.tika.bundle [17]] is not resolved
 ERROR: Bundle org.apache.tika.bundle [17] Error starting
 file:/Users/mattmann/tmp/tika/tika-bundle/target/test-bundles/tika-bundle.
 j
 ar (org.osgi.framework.BundleException: Unresolved constraint in bundle
 org.apache.tika.bundle [17]: Unable to resolve 17.0: missing requirement
 [17.0] osgi.wiring.package;
 ((osgi.wiring.package=org.apache.commons.csv)(version=1.0.0)(!(version=
 2
 .0.0
 org.osgi.framework.BundleException: Unresolved constraint in bundle
 org.apache.tika.bundle [17]: Unable to resolve 17.0: missing requirement
 [17.0] osgi.wiring.package;
 ((osgi.wiring.package=org.apache.commons.csv)(version=1.0.0)(!(version=
 2
 .0.0)))
at
 org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:4097)
at 

Re: TIKA-1423 Build a parser to extract data from GRIB formats not good with Java 6

2015-01-30 Thread Oleg Tikhonov
Hi there,
+1 for dropping.
 On 30 Jan 2015 05:05, Tyler Palsulich tpalsul...@gmail.com wrote:

 +1

 Tyler
 On Jan 29, 2015 9:52 PM, Mattmann, Chris A (3980) 
 chris.a.mattm...@jpl.nasa.gov wrote:

  +1 move to 1.7
 
  Sent from my iPhone
 
   On Jan 29, 2015, at 5:04 PM, Allison, Timothy B. talli...@mitre.org
  wrote:
  
   +1 to dropping 1.6...let's move to 1.8 and beyond! :)
  
   -Original Message-
   From: Lewis John Mcgibbney [mailto:lewis.mcgibb...@gmail.com]
   Sent: Thursday, January 29, 2015 6:51 PM
   To: dev@tika.apache.org
   Subject: TIKA-1423 Build a parser to extract data from GRIB formats not
  good with Java 6
  
   Hi Folks,
   Having committed TIKA-1423 it has become apparent to me that the
  libraries
   being pulled as dependencies are not compatible with JDK 1.6 as
 indicated
   with our Jenkins 1.6 build.
  
   Do we want to move towards dropping support for Java 1.6? Oracle made
 an
   announcement some time ago so this is not exactly new news
  
   https://blogs.oracle.com/henrik/entry/java_6_eol_h_h
  
  
   [ERROR] Failed to execute goal de.thetaphi:forbiddenapis:1.7:check
   (default) on project tika-parsers: Check for forbidden API calls
   failed: java.lang.ClassNotFoundException: Class
   'java.lang.AutoCloseable' not found on classpath - [Help 1][ERROR]
   [ERROR] To see the full stack trace of the errors, re-run Maven with
   the -e switch.[ERROR] Re-run Maven using the -X switch to enable full
   debug logging.[ERROR] [ERROR] For more information about the errors
   and possible solutions, please read the following articles:[ERROR]
   [Help 1]
 
 http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException[ERROR]
   [ERROR] After correcting the problems, you can resume the build with
   the command[ERROR]   mvn goals -rf :tika-parsers
  
  
  
   --
   *Lewis*
 



Re: FW: Any interest in running Apache Tika as part of CommonCrawl?

2015-04-03 Thread Oleg Tikhonov
I Tim,
Having looked at CC, a couple of ideas crossed the mind. I think it's cool.
+1.

BR,
Oleg
On 3 Apr 2015 17:29, Allison, Timothy B. talli...@mitre.org wrote:

 All,

 What do you think?


 https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0


 On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.commailto:
 talliso...@gmail.com wrote:
 CommonCrawl currently has the WET format that extracts plain text from web
 pages.  My guess is that this is text stripping from text-y formats.  Let
 me know if I'm wrong!

 Would there be any interest in adding another format: WETT (WET-Tika) or
 supplementing the current WET by using Tika to extract contents from binary
 formats too: PDF, MSWord, etc.

 Julien Nioche kindly carved out 220 GB for us to experiment with on
 TIKA-1302https://issues.apache.org/jira/browse/TIKA-1302 on a Rackspace
 vm.  But, I'm wondering now if it would make more sense to have CommonCrawl
 run Tika as part of its regular process and make the output available in
 one of your standard formats.

 CommonCrawl consumers would get Tika output, and the Tika dev community
 (including its dependencies, PDFBox, POI, etc.) could get the stacktraces
 to help prioritize bug fixes.

 Cheers,

   Tim



Re: [VOTE] Apache Tika 1.8 Release Candidate #2

2015-04-15 Thread Oleg Tikhonov
Hi Tyler,

good job, indeed !!!

[x] +1 Release this package as Apache Tika 1.8

On Wed, Apr 15, 2015 at 8:22 AM, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

 Thanks Tyler! +1 from me:

 SIGS, checksums check out:


 [chipotle:~/tmp/apache-tika-1.8-rc2] mattmann% $HOME/bin/stage_apache_rc
 tika 1.8-src https://dist.apache.org/repos/dist/dev/tika/

   % Total% Received % Xferd  Average Speed   TimeTime Time
 Current

  Dload  Upload   Total   SpentLeft
 Speed

 100 69.2M  100 69.2M0 0  1524k  0  0:00:46  0:00:46 --:--:--
 1661k

   % Total% Received % Xferd  Average Speed   TimeTime Time
 Current

  Dload  Upload   Total   SpentLeft
 Speed

 100   473  100   4730 0874  0 --:--:-- --:--:-- --:--:--
  874

   % Total% Received % Xferd  Average Speed   TimeTime Time
 Current

  Dload  Upload   Total   SpentLeft
 Speed

 10033  100330 0 62  0 --:--:-- --:--:-- --:--:--
   62

 [chipotle:~/tmp/apache-tika-1.8-rc2] mattmann% $HOME/bin/stage_apache_rc
 tika-app 1.8 https://dist.apache.org/repos/dist/dev/tika/

   % Total% Received % Xferd  Average Speed   TimeTime Time
 Current

  Dload  Upload   Total   SpentLeft
 Speed

 100 44.0M  100 44.0M0 0  1742k  0  0:00:25  0:00:25 --:--:--
 1825k

   % Total% Received % Xferd  Average Speed   TimeTime Time
 Current

  Dload  Upload   Total   SpentLeft
 Speed

 100   473  100   4730 0922  0 --:--:-- --:--:-- --:--:--
  922

   % Total% Received % Xferd  Average Speed   TimeTime Time
 Current

  Dload  Upload   Total   SpentLeft
 Speed

 10033  100330 0 63  0 --:--:-- --:--:-- --:--:--
   63

 [chipotle:~/tmp/apache-tika-1.8-rc2] mattmann% $HOME/bin/stage_apache_rc
 tika-server 1.8 https://dist.apache.org/repos/dist/dev/tika/

   % Total% Received % Xferd  Average Speed   TimeTime Time
 Current

  Dload  Upload   Total   SpentLeft
 Speed

 100 48.3M  100 48.3M0 0  1379k  0  0:00:35  0:00:35 --:--:--
 1569k

   % Total% Received % Xferd  Average Speed   TimeTime Time
 Current

  Dload  Upload   Total   SpentLeft
 Speed

 100   473  100   4730 0891  0 --:--:-- --:--:-- --:--:--
  892

   % Total% Received % Xferd  Average Speed   TimeTime Time
 Current

  Dload  Upload   Total   SpentLeft
 Speed

 10033  100330 0 62  0 --:--:-- --:--:-- --:--:--
   62

 [chipotle:~/tmp/apache-tika-1.8-rc2] mattmann%


 [chipotle:~/tmp/apache-tika-1.8-rc2] mattmann% $HOME/bin/verify_gpg_sigs

 Verifying Signature for file tika-1.8-src.zip.asc

 gpg: Signature made Mon Apr 13 13:46:39 2015 EDT using RSA key ID D4F10117

 gpg: Good signature from Tyler Palsulich tpalsul...@apache.org

 gpg: WARNING: This key is not certified with a trusted signature!

 gpg:  There is no indication that the signature belongs to the
 owner.

 Primary key fingerprint: 1D32 9CC2 D69C 821B FBE4  183E 8810 BB19 D4F1 0117

 Verifying Signature for file tika-app-1.8.jar.asc

 gpg: Signature made Mon Apr 13 13:43:13 2015 EDT using RSA key ID D4F10117

 gpg: Good signature from Tyler Palsulich tpalsul...@apache.org

 gpg: WARNING: This key is not certified with a trusted signature!

 gpg:  There is no indication that the signature belongs to the
 owner.

 Primary key fingerprint: 1D32 9CC2 D69C 821B FBE4  183E 8810 BB19 D4F1 0117

 Verifying Signature for file tika-server-1.8.jar.asc

 gpg: Signature made Mon Apr 13 13:45:00 2015 EDT using RSA key ID D4F10117

 gpg: Good signature from Tyler Palsulich tpalsul...@apache.org

 gpg: WARNING: This key is not certified with a trusted signature!

 gpg:  There is no indication that the signature belongs to the
 owner.

 Primary key fingerprint: 1D32 9CC2 D69C 821B FBE4  183E 8810 BB19 D4F1 0117

 [chipotle:~/tmp/apache-tika-1.8-rc2] mattmann%
 $HOME/bin/verify_md5_checksums

 md5sum: stat '*.tar.gz': No such file or directory

 md5sum: stat '*.bz2': No such file or directory

 md5sum: stat '*.tgz': No such file or directory

 tika-1.8-src.zip: OK

 [chipotle:~/tmp/apache-tika-1.8-rc2] mattmann%

 Cheers!

 Chris

 
 From: Tyler Palsulich [tpalsul...@apache.org]
 Sent: Monday, April 13, 2015 10:56 AM
 To: dev@tika.apache.org; u...@tika.apache.org
 Subject: [VOTE] Apache Tika 1.8 Release Candidate #2

 Hi Folks,

 A candidate for the Tika 1.8 release is available at:
   https://dist.apache.org/repos/dist/dev/tika/

 The release candidate is a zip archive of the sources in:
   http://svn.apache.org/repos/asf/tika/tags/1.8-rc2/

 

Re: [VOTE] Release Apache Tika 1.8 Candidate #1

2015-04-08 Thread Oleg Tikhonov
Hi,
[x] +1 Release this package as Apache Tika 1.8.

Tested on: Ubuntu 14.10, x86_64. Java 1.7 (Oracle)
Don't we want to update the following dependencies:
biz.aQute:bndlib . 1.43.0 - 2.0.0.20130123-133441
org.apache.felix:org.apache.felix.scr.annotations  1.6.0 - 1.9.10
org.osgi:org.osgi.compendium .. 4.0.0 - 5.0.0
org.osgi:org.osgi.core  4.0.0 - 6.0.0
com.drewnoakes:metadata-extractor . 2.7.2 - 2.8.0
com.google.guava:guava  10.0.1 - 18.0
edu.ucar:grib  4.5.5 - 8.0.29
org.ow2.asm:asm-debug-all ... 4.1 - 5.0.3
commons-io:commons-io . 2.1 - 2.4
javax.mail:mail ... 1.4.4 - 1.5.0-b01
org.apache.cxf:cxf-rt-frontend-jaxrs .. 2.7.8 - 3.0.4

BR,
Oleg




On Wed, Apr 8, 2015 at 2:55 AM, Tyler Palsulich tpalsul...@apache.org
wrote:

 CC'ing user@tika for visibility.

 Tyler

 On Tue, Apr 7, 2015 at 4:54 PM, Tyler Palsulich tpalsul...@apache.org
 wrote:

  Hi Folks,
 
  A candidate for the Tika 1.8 release is available at:
https://dist.apache.org/repos/dist/dev/tika/
 
  The release candidate is a zip archive of the sources in:
http://svn.apache.org/repos/asf/tika/tags/1.8-rc1/
 
  The SHA1 checksum of the archive is
ddeb3b43ca1c1ef346658a7005434019507e096f.
 
  In addition, a staged maven repository is available here:
https://repository.apache.org/content/repositories/orgapachetika-1008
 
  Please vote on releasing this package as Apache Tika 1.8.
  The vote is open for the next 72 hours and passes if a majority of at
  least three +1 Tika PMC votes are cast.
 
  [ ] +1 Release this package as Apache Tika 1.8
  [ ] -1 Do not release this package because...
 
  Have a good night!
  Tyler
 



Re: [VOTE] Release Apache Tika 1.9 Candidate #2

2015-06-09 Thread Oleg Tikhonov
Hi,
All basic tests are passed.
java version 1.7.0_75
Java(TM) SE Runtime Environment (build 1.7.0_75-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.75-b04, mixed mode)

Linux/Ubuntu x86_64
Superb !!!
[x] +1 Release this package as Apache Tika 1.9

Thanks,
Oleg

On Tue, Jun 9, 2015 at 2:12 PM, Sergey Beryozkin sberyoz...@gmail.com
wrote:

 +1

 Cheers, Sergey



 On Mon, Jun 8, 2015 at 1:11 PM Allison, Timothy B. talli...@mitre.org
 wrote:

  +1

 Built in Windows and Linux.  Works on problems (that I caused!) in rc1.

 Let's make sure to include last Java 1.6 version in the release notes,
 if that's what we've decided.

 Thank you, Chris!

 Best,

 Tim


 -Original Message-
 From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov]
 Sent: Saturday, June 06, 2015 9:47 PM
 To: dev@tika.apache.org
 Cc: u...@tika.apache.org
 Subject: [VOTE] Release Apache Tika 1.9 Candidate #2

 Hi Folks,

 A second candidate for the Tika 1.9 release is available at:

https://dist.apache.org/repos/dist/dev/tika/

 The release candidate is a zip archive of the sources in:
http://svn.apache.org/repos/asf/tika/tags/1.9-rc2/

 The SHA1 checksum of the archive is
 9b78c9e9ce9640b402b7fef8e30f3cdbe384f44c.

 In addition, a staged maven repository is available here:
 https://repository.apache.org/content/repositories/orgapachetika-1011/


 Please vote on releasing this package as Apache Tika 1.9.
 The vote is open for the next 72 hours and passes if a majority of at
 least three +1 Tika PMC votes are cast.

 [ ] +1 Release this package as Apache Tika 1.9
 [ ] -1 Do not release this package because…

 Cheers,
 Chris

 P.S. Of course here is my +1.


 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++









 --
 Sergey Beryozkin

 Talend Community Coders
 http://coders.talend.com/

 Blog: http://sberyozkin.blogspot.com



Re: Apache Tika: In use at Goldman Sachs

2015-08-20 Thread Oleg Tikhonov
Wow !!! Amazing.
How does it perform?

BR,
Oleg

On Thu, Aug 20, 2015 at 9:48 AM, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

 Just saw this online:

 http://www.informationweek.com/software/enterprise-applications/goldman-sac
 hs-puts-elasticsearch-to-work/d/d-id/1321778


 Apache Tika is a BIG part of this!

 Cheers,
 Chris

 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






Re: release Tika 1.10?

2015-08-04 Thread Oleg Tikhonov
Thanks!
+1

BR,
Oleg

On Tue, Aug 4, 2015 at 5:37 AM, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

 +1
 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++





 -Original Message-
 From: Allison, Timothy B. talli...@mitre.org
 Reply-To: dev@tika.apache.org dev@tika.apache.org
 Date: Tuesday, July 28, 2015 at 11:08 AM
 To: dev@tika.apache.org dev@tika.apache.org
 Subject: RE: release Tika 1.10?

 Just finished the run against ~2.8 million docs (4.8 million including
 attachments) from a combination of govdocs1 and Common Crawl.  I compared
 1.9 with trunk.
 
 Most looks good.
 
 Some highlights:
 * Thanks to Andrew Jackson and TIKA-1678, we're now getting better
 metadata out of ~1300 from 550k PDFs. This appears to be far more common
 in Common Crawl PDFs than in govdocs1 PDFs.
 * No significant changes found in the handful of msg files...I wanted to
 check after the work on TIKA-1238.
 * Thanks to Andreas Beeker and TIKA-1046/POI 54332, there are far fewer
 PPT exceptions
 * There are a very few more files in CommonCrawl that are now incorrectly
 identified as RFC vs text (TIKA-1602), but this is a tiny handful (total
 of 4 documents in both CC and govdocs1)
 
 A regret:
 This run used the digesting parser for both container and embedded files.
  This causes some truncated (=corrupt) package files to throw an
 exception before they otherwise would.  The opposite happens, too (more
 embedded files when using the digester), but this is extremely rare. This
 means that for truncated gz, x-xz and x-archive files there are many more
 with fewer attachments in Tika 1.10-SNAPSHOT than in Tika 1.9.
 
 With Konstantin's and Bob's fix of TIKA-1524, I think we're in good shape
 for 1.10...from my perspective.
 
  Best,
 
Tim
 -Original Message-
 From: David Meikle [mailto:loo...@gmail.com]
 Sent: Sunday, July 26, 2015 10:50 AM
 To: dev@tika.apache.org
 Subject: Re: release Tika 1.10?
 
 
  On 23 Jul 2015, at 14:07, Allison, Timothy B. talli...@mitre.org
 wrote:
 
   With the fix of TIKA-1690, I think it makes sense to roll a new
 release (1.10) in the next week or so.  I'd like to get TIKA-1667
 (upgrade poi) in before the release.  Are there any other blockers on
 1.10?
 
 +1 from me too.  As discussed on private, I will roll the release on
 Tuesday night (UK Time) to give people time to shout for other candidates.
 
 Cheers,
 Dave




Re: [VOTE] Apache Tika 1.10 Release Candidate #1

2015-08-04 Thread Oleg Tikhonov
Hi, thanks for doing that !!!
+1 for the release.
Ran on Kubuntu 15 x64. All basic tests are passed.

BR,
Oleg

On Tue, Aug 4, 2015 at 6:17 AM, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

 +1 from me, great work Dave SIGS and CHECKSUMS are sound:

 [chipotle:~/tmp/tika-1.10-rc1] mattmann% /bin/bash
 bash-3.2$ for type in app server; do
  for version in 1.10 1.10-src; do
  /Users/mattmann/bin/stage_apache_rc tika-$type $version
 https://dist.apache.org/repos/dist/dev/tika/
  done
  done
   % Total% Received % Xferd  Average Speed   TimeTime Time
 Current
  Dload  Upload   Total   SpentLeft
 Speed
 100 45.0M  100 45.0M0 0  1481k  0  0:00:31  0:00:31 --:--:--
 1937k
   % Total% Received % Xferd  Average Speed   TimeTime Time
 Current
  Dload  Upload   Total   SpentLeft
 Speed
 100   819  100   8190 0   2057  0 --:--:-- --:--:-- --:--:--
 2062
   % Total% Received % Xferd  Average Speed   TimeTime Time
 Current
  Dload  Upload   Total   SpentLeft
 Speed
 10033  100330 0 80  0 --:--:-- --:--:-- --:--:--
  80
   % Total% Received % Xferd  Average Speed   TimeTime Time
 Current
  Dload  Upload   Total   SpentLeft
 Speed
 100 50.5M  100 50.5M0 0  1586k  0  0:00:32  0:00:32 --:--:--
 2134k
   % Total% Received % Xferd  Average Speed   TimeTime Time
 Current
  Dload  Upload   Total   SpentLeft
 Speed
 100   819  100   8190 0   1910  0 --:--:-- --:--:-- --:--:--
 1913
   % Total% Received % Xferd  Average Speed   TimeTime Time
 Current
  Dload  Upload   Total   SpentLeft
 Speed
 10033  100330 0 78  0 --:--:-- --:--:-- --:--:--
  78
 bash-3.2$ ls
 tika-app-1.10.jar tika-app-1.10.jar.asc tika-app-1.10.jar.md5
tika-server-1.10.jar  tika-server-1.10.jar.asc
 tika-server-1.10.jar.md5
 bash-3.2$ $HOME/bin/stage_apache_rc tika 1.10-src
 https://dist.apache.org/repos/dist/dev/tika/
   % Total% Received % Xferd  Average Speed   TimeTime Time
 Current
  Dload  Upload   Total   SpentLeft
 Speed
 100 73.6M  100 73.6M0 0  2044k  0  0:00:36  0:00:36 --:--:--
 2700k
   % Total% Received % Xferd  Average Speed   TimeTime Time
 Current
  Dload  Upload   Total   SpentLeft
 Speed
 100   819  100   8190 0   1950  0 --:--:-- --:--:-- --:--:--
 1950
   % Total% Received % Xferd  Average Speed   TimeTime Time
 Current
  Dload  Upload   Total   SpentLeft
 Speed
 10033  100330 0 77  0 --:--:-- --:--:-- --:--:--
  78
 bash-3.2$ ls
 tika-1.10-src.zip tika-1.10-src.zip.md5 tika-app-1.10.jar.asc
tika-server-1.10.jar  tika-server-1.10.jar.md5
 tika-1.10-src.zip.asc tika-app-1.10.jar tika-app-1.10.jar.md5
tika-server-1.10.jar.asc
 bash-3.2$ exit
 exit
 [chipotle:~/tmp/tika-1.10-rc1] mattmann% $HOME/bin/verify_gpg_sigs
 Verifying Signature for file tika-1.10-src.zip.asc
 gpg: Signature made Sat Aug  1 23:34:31 2015 PDT using RSA key ID 0EB30B07
 gpg: Good signature from David Meikle (CODE SIGNING KEY)
 dmei...@apache.org
 gpg: WARNING: This key is not certified with a trusted signature!
 gpg:  There is no indication that the signature belongs to the
 owner.
 Primary key fingerprint: F3F2 3C1E DB33 8077 254E  DEEC 5241 4B0B 0EB3 0B07
 Verifying Signature for file tika-app-1.10.jar.asc
 gpg: Signature made Sat Aug  1 23:24:15 2015 PDT using RSA key ID 0EB30B07
 gpg: Good signature from David Meikle (CODE SIGNING KEY)
 dmei...@apache.org
 gpg: WARNING: This key is not certified with a trusted signature!
 gpg:  There is no indication that the signature belongs to the
 owner.
 Primary key fingerprint: F3F2 3C1E DB33 8077 254E  DEEC 5241 4B0B 0EB3 0B07
 Verifying Signature for file tika-server-1.10.jar.asc
 gpg: Signature made Sat Aug  1 23:30:05 2015 PDT using RSA key ID 0EB30B07
 gpg: Good signature from David Meikle (CODE SIGNING KEY)
 dmei...@apache.org
 gpg: WARNING: This key is not certified with a trusted signature!
 gpg:  There is no indication that the signature belongs to the
 owner.
 Primary key fingerprint: F3F2 3C1E DB33 8077 254E  DEEC 5241 4B0B 0EB3 0B07
 [chipotle:~/tmp/tika-1.10-rc1] mattmann% $HOME/bin/verify_md5_checksums
 md5sum: stat '*.tar.gz': No such file or directory
 md5sum: stat '*.bz2': No such file or directory
 md5sum: stat '*.tgz': No such file or directory
 tika-1.10-src.zip: OK
 [chipotle:~/tmp/tika-1.10-rc1] mattmann%





 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and 

Re: Bayesian N-Gram Language Detection

2015-07-29 Thread Oleg Tikhonov
+1 !!!
My two cents.
Please also add ability to change/retrain/tote language profiles.

Thanks !!!
BR,
Oleg

On Wed, Jul 29, 2015 at 3:59 AM, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

 Cool. Well with this one I found, along with language-detector,
 along with Ramirez and the work with Joe Campbell’s group at MIT-LL
 and the Julia stuff, I for one am going to take the step to make it
 pluggable.

 I’ll try and take this on over the next week. I’ll use a ServiceLoader
 approach similar to Translators, Detectors, Parsers, etc.

 Cheers,
 Chris

 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++





 -Original Message-
 From: Ken Krugler kkrugler_li...@transpac.com
 Reply-To: dev@tika.apache.org dev@tika.apache.org
 Date: Tuesday, July 28, 2015 at 5:39 PM
 To: dev@tika.apache.org dev@tika.apache.org
 Subject: RE: Bayesian N-Gram Language Detection

 I think switching to language-detector is a reasonable first step (more
 languages, faster, better accuracy), after which we can evaluate the need
 to make it pluggable.
 
 There were some code  resource packaging issues with the original
 project, but the fork I've been trying out seems much better.
 
 See https://github.com/optimaize/language-detector
 
 Still ALv2, and already in the Maven central repo.
 
 -- Ken
 
  From: Mattmann, Chris A (3980)
  Sent: July 28, 2015 5:30:00pm PDT
  To: dev@tika.apache.org
  Subject: Bayesian N-Gram Language Detection
 
  FYI the code is ALv2:
 
  https://github.com/shuyo/language-detection/blob/wiki/ProjectHome.md
 
 
  I’m going to test this out and see how it compares with our own.
  Maybe we need to make the Language Detector pluggable too.
 
  ++
  Chris Mattmann, Ph.D.
  Chief Architect
  Instrument Software and Science Data Systems Section (398)
  NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
  Office: 168-519, Mailstop: 168-527
  Email: chris.a.mattm...@nasa.gov
  WWW:  http://sunset.usc.edu/~mattmann/
  ++
  Adjunct Associate Professor, Computer Science Department
  University of Southern California, Los Angeles, CA 90089 USA
  ++
 
 
 
 --
 Ken Krugler
 +1 530-210-6378
 http://www.scaleunlimited.com
 custom big data solutions  training
 Hadoop, Cascading, Cassandra  Solr
 
 
 
 
 




Re: [VOTE] Apache Tika 1.11 Release Candidate #1

2015-10-25 Thread Oleg Tikhonov
Hi guys, all looks fine on basic set up in x86_64 Ubuntu, however I got the
following:
Running org.apache.tika.parser.journal.JournalParserTest
25 Oct 2015 10:45:53  WARN PhaseInterceptorChain - Interceptor for {
http://localhost:8080/grobid}WebClient has thrown exception, unwinding now
org.apache.cxf.interceptor.Fault: Could not send Message.
at
org.apache.cxf.interceptor.MessageSenderInterceptor$MessageSenderEndingInterceptor.handleMessage(MessageSenderInterceptor.java:64)
at
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
at
org.apache.cxf.jaxrs.client.AbstractClient.doRunInterceptorChain(AbstractClient.java:623)
at
org.apache.cxf.jaxrs.client.WebClient.doChainedInvocation(WebClient.java:1084)
at org.apache.cxf.jaxrs.client.WebClient.doInvoke(WebClient.java:883)
at org.apache.cxf.jaxrs.client.WebClient.doInvoke(WebClient.java:854)
at org.apache.cxf.jaxrs.client.WebClient.invoke(WebClient.java:320)
at org.apache.cxf.jaxrs.client.WebClient.get(WebClient.java:346)
at
org.apache.tika.parser.journal.GrobidRESTParser.canRun(GrobidRESTParser.java:102)
at
org.apache.tika.parser.journal.JournalParserTest.testJournalParser(JournalParserTest.java:39)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
at
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
at
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:283)
at
org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:173)
at
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
at
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:128)
at
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:203)
at
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:155)
at
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:103)
Caused by: java.net.ConnectException: ConnectException invoking
http://localhost:8080/grobid: Connection refused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at
org.apache.cxf.transport.http.HTTPConduit$WrappedOutputStream.mapException(HTTPConduit.java:1359)
at
org.apache.cxf.transport.http.HTTPConduit$WrappedOutputStream.close(HTTPConduit.java:1343)
at
org.apache.cxf.transport.AbstractConduit.close(AbstractConduit.java:56)
at org.apache.cxf.transport.http.HTTPConduit.close(HTTPConduit.java:638)
at
org.apache.cxf.interceptor.MessageSenderInterceptor$MessageSenderEndingInterceptor.handleMessage(MessageSenderInterceptor.java:62)
... 33 more
Caused by: java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.(HttpClient.java:211)
at 

Re: Remove support for building language identifier profiles?

2015-08-30 Thread Oleg Tikhonov
Hi Ken,
I would be choose the last option you've mentioned.

-- Oleg

On Sat, Aug 29, 2015 at 7:58 PM, Ken Krugler kkrugler_li...@transpac.com
wrote:

 Hi all,

 As part of integrating language-detector into Tika (see TIKA-1723), I
 noticed TIKA-546 (Add ability to create language profiles to tika-app)

 If we switch over to language-detector, then this code no longer makes
 sense.

 Also note that many language detectors require the full set of language
 data in order to generate the most relevant (discriminating) ngrams, thus
 the current support for passing in data for one language doesn't work.

 So any suggestions for what to do? Leave the code as is, with deprecated
 annotations, even though the profiles generated won't be useful?

 Or wait for pluggable detectors, and someone could port the current Tika
 code - then this profile building support might still make sense, though it
 would want to be moved into the specific plugin.

 -- Ken





Re: [ANNOUNCE] Welcome Bob Paulin as Tika Committer + PMC Member

2015-09-17 Thread Oleg Tikhonov
Good intro. Welcome a board.
Oleg
On 17 Sep 2015 03:05, "David Meikle"  wrote:

> Hello All,
>
> Please welcome Bob Paulin as he joins us as the latest Tika committer and
> PMC Member.
>
> Bob, please feel free to say a bit about yourself as an introduction to
> the group.
>
> Welcome aboard,
> Dave
>
>
>
>
>


Re: [DISCUSS] Moving to Git

2015-11-19 Thread Oleg Tikhonov
+1.
There is a bunch of add-ons. For instance - git flow.


On Wed, Nov 18, 2015 at 7:15 PM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Hey Nick,
>
> Git has something similar to svn:externals:
>
> http://stackoverflow.com/questions/571232/svnexternals-equivalent-in-git
>
>
> I’ve seen both used in the same way. Also the examples site code
> is something we could always gin up a script solution to and isn’t
> a blocker by any means - it’s a smallish portion of the overall
> process and even if it had to be done by hand it’s something we don’t
> do often enough for it to be a real burden. I can speak from experience
> having done most or all of Tika’s releases.
>
> As to the discussions of what’s going on with Git/Github/version
> control, etc., the use of writeable Git repositories at the ASF
> has been sanctioned and used pervasively for years. That Git/Github
> /version control *policy* discussion is pretty independent of using
> the ASF’s own sanctioned writeable git repos on ASF hardware, which
> is all I’m proposing to do. AKA I’m proposing we move Tika’s
> canonical repo from:
>
> http://svn.apache.org/repos/asf/tika/
>
> TO:
>
> https://git-wip-us.apache.org/repos/asf/tika.git
>
> Infra has put policies (temporarily) in place to deal with any of
> the branching issues that have shown up etc. So there is already
> enforcement and so on. And like I said, the ASF has allowed writeable
> Git repos for many years now.
>
> Finally it seems like there is good support so far for this, so
> I’ll keep collecting feedback before calling an official vote maybe
> in the next few days. I’m really hoping there is really no big
> difference other than replacing svn co with git clone and replacing
> svn commit with git commit && git push in most places. One last note:
> many of the “issues” brought up on other projects or being discussed
> at a Foundation policy level are issues e.g., with the Incubator,
> some with newer (ish) TLPs that have arisen over the past few years
> and that are pushing the boundaries on how to use Git in ways that
> are forcing the foundation to ask questions at its core policy
> levels. That discussion is ongoing. Tika has been around since 2007,
> includes a strong set of ASF members, has seen the version control
> debates over the years and long since survived them, etc. I see no
> evidence and an extremely low probability that we will use writeable
> ASF git repos in any such way that drives the policy at the foundation
> level in the same way.
>
> Instead, I see pretty boring use of Git writeable repos to become
> more consistent with the way it seems like more and more of us are
> doing development (even today with Tika).
>
> HTH.
>
> Cheers,
> Chris
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>
>
>
> -Original Message-
> From: Nick Burch 
> Reply-To: "dev@tika.apache.org" 
> Date: Wednesday, November 18, 2015 at 7:44 AM
> To: "dev@tika.apache.org" 
> Subject: Re: [DISCUSS] Moving to Git
>
> >On Wed, 18 Nov 2015, Mattmann, Chris A (3980) wrote:
> >> I propose we move to writeable git repos for Tika for our repository. I
> >> mostly interact with Git & Github nowadays even with Tika using the
> >> mirroring and PR interaction support.
> >
> >I'm -0 on this at the moment
> >
> >Having followed other Apache lists, it seems that there's quite a few
> >ways
> >to use Git, not all of them compatible with the Apache way, and some of
> >them easy to do wrong.
> >
> >Were we to have some proposed guidelines/information/rules on using Git
> >for Tika, such as about what branches squashing might be permitted on,
> >rules for that, information/rules on remote branches, how to handle /
> >when
> >to use / not-use private branches and github branches, and the like, then
> >I'd be minded to change my vote
> >
> >I'm also wondering how it would work with the website pulling in bits of
> >the Tika Examples module from SVN for the examples page? That currently
> >uses a svn:externals, so we can keep the code in a normal module + unit
> >test it, then pulls in snippets, how would that work if the code moved to
> >git?
> >
> >Nick
>
>


Re: [VOTE] Apache Tika 1.12 Release Candidate #1

2016-01-28 Thread Oleg Tikhonov
Hi Chris,
thanks for doing it.
Yesterday I successfuly build the tika using mvn clean install.
All tests are passed. Platform: x86_64 Kubuntu with Oracle Java 8. Nothing
special was ran.

[x] +1 Release this package as Apache Tika 1.12

Best regards,
Oleg

On Mon, Jan 25, 2016 at 9:58 PM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Hi Folks,
>
> A first candidate for the Tika 1.12 release is available at:
>
>   https://dist.apache.org/repos/dist/dev/tika/
>
> The release candidate is a zip archive of the sources in:
> https://git-wip-us.apache.org/repos/asf?p=tika.git;a=tag;h=203a26ba5e65db24
> 27f9e84bc4ff31e569ae661c
>
>
> The SHA1 checksum of the archive is:
> 30e64645af643959841ac3bb3c41f7e64eba7e5f
>
> In addition, a staged maven repository is available here:
>
> https://repository.apache.org/content/repositories/orgapachetika-1015/
>
>
> Please vote on releasing this package as Apache Tika 1.12.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Tika PMC votes are cast.
>
> [ ] +1 Release this package as Apache Tika 1.12
> [ ] -1 Do not release this package because…
>
> Cheers,
> Chris
>
> P.S. Of course here is my +1.
>
>
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>


Re: Master Build Failing

2016-10-25 Thread Oleg Tikhonov
hi Luis,
Here what  I did:
git clone https://git-wip-us.apache.org/repos/asf/tika.git
git branch
* master

gdalinfo --version
GDAL 1.11.3, released 2015/09/16

mvn clean install -U

Tests run: 3, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 42.59 sec -
in org.apache.tika.parser.gdal.TestGDALParser
Running org.apache.tika.parser.executable.ExecutableParserTest


OS: Ubuntu 16, x86_64.






On Mon, Oct 24, 2016 at 8:57 PM, lewis john mcgibbney 
wrote:

> Hi Folks,
> Is master build failing for anyone? I got a brand new laptop and have GDAL
> installed.
> 
> ---
> Test set: org.apache.tika.parser.gdal.TestGDALParser
> 
> ---
> Tests run: 3, Failures: 3, Errors: 0, Skipped: 0, Time elapsed: 0.3 sec <<<
> FAILURE! - in org.apache.tika.parser.gdal.TestGDALParser
> testParseBasicInfo(org.apache.tika.parser.gdal.TestGDALParser)  Time
> elapsed: 0.124 sec  <<< FAILURE!
> java.lang.AssertionError: null
> at org.junit.Assert.fail(Assert.java:86)
> at org.junit.Assert.assertTrue(Assert.java:41)
> at org.junit.Assert.assertNotNull(Assert.java:621)
> at org.junit.Assert.assertNotNull(Assert.java:631)
> at
> org.apache.tika.parser.gdal.TestGDALParser.testParseBasicInfo(
> TestGDALParser.java:79)
>
> testParseFITS(org.apache.tika.parser.gdal.TestGDALParser)  Time elapsed:
> 0.101 sec  <<< FAILURE!
> java.lang.AssertionError: null
> at org.junit.Assert.fail(Assert.java:86)
> at org.junit.Assert.assertTrue(Assert.java:41)
> at org.junit.Assert.assertNotNull(Assert.java:621)
> at org.junit.Assert.assertNotNull(Assert.java:631)
> at
> org.apache.tika.parser.gdal.TestGDALParser.testParseFITS(
> TestGDALParser.java:165)
>
> testParseMetadata(org.apache.tika.parser.gdal.TestGDALParser)  Time
> elapsed: 0.075 sec  <<< FAILURE!
> java.lang.AssertionError: null
> at org.junit.Assert.fail(Assert.java:86)
> at org.junit.Assert.assertTrue(Assert.java:41)
> at org.junit.Assert.assertNotNull(Assert.java:621)
> at org.junit.Assert.assertNotNull(Assert.java:631)
> at
> org.apache.tika.parser.gdal.TestGDALParser.testParseMetadata(
> TestGDALParser.java:117)
>
>
>
> --
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney
>


Re: [VOTE] Apache Tika 1.14 Release Candidate #1

2016-10-20 Thread Oleg Tikhonov
Hi,
+1 for release.
Built on Ubuntu 16.04 and CentOS 7.0 x86_64.

All tests are passed. Java 8.

BR,
Oleg

On Thu, Oct 20, 2016 at 5:54 PM, Julien Nioche <
lists.digitalpeb...@gmail.com> wrote:

> Hi Tim
>
> I had exiftool installed indeed, so that might explain it. All tests now
> pass. Will have a closer look at it all later.
>
> Thanks
>
> Julien
>
> On 20 October 2016 at 13:45, Allison, Timothy B. 
> wrote:
>
> > https://issues.apache.org/jira/browse/TIKA-2056
> >
> > Perhaps?
> >
> > -Original Message-
> > From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com]
> > Sent: Thursday, October 20, 2016 8:34 AM
> > To: dev@tika.apache.org
> > Subject: Re: [VOTE] Apache Tika 1.14 Release Candidate #1
> >
> > Hi
> >
> > Am getting the following when running 'mvn clean package', have I
> > forgotten something obvious?
> >
> > Julien
> >
> > *Failed tests: *
> > *  ForkParserIntegrationTest.testParserHandlingOfNonSerializable:210
> > expected: but
> > was:*
> *Tests
> > in error: *
> > *
> > ForkParserIntegrationTest.testAttachingADebuggerOnTheFor
> > kedParserShouldWork:234
> > » Tika*
> > *  ForkParserIntegrationTest.testForkedPDFParsing:257 » Tika Unable to
> > serialize ...*
> > *  ForkParserIntegrationTest.testForkedTextParsing:66 » Tika Unable to
> > serialize ...*
> >
> > *Tests run: 755, Failures: 1, Errors: 3, Skipped: 17*
> >
> > *[INFO]
> > 
> *
> > *[INFO] Reactor Summary:*
> > *[INFO] *
> > *[INFO] Apache Tika parent  SUCCESS
> > [4.368s]*
> > *[INFO] Apache Tika core .. SUCCESS
> > [16.487s]*
> > *[INFO] Apache Tika parsers ... FAILURE
> > [4:54.631s]*
> >
> >
> >
> > On 19 October 2016 at 19:48, Chris Mattmann  wrote:
> >
> > > Hi Folks,
> > >
> > > A first candidate for the Tika 1.14 release is available at:
> > >
> > >   https://dist.apache.org/repos/dist/dev/tika/
> > >
> > > The release candidate is a zip archive of the sources in:
> > >
> > > https://git-wip-us.apache.org/repos/asf?p=tika.git;a=tree;hb=
> > > 687d7706c9778e4f49f2834a07e5a9d99b23042b
> > >
> > > The SHA1 checksum of the archive is:
> > > ad9152392ffe6b620c8102ab538df0579b36c520
> > >
> > > In addition, a staged maven repository is available here:
> > >
> > > https://repository.apache.org/content/repositories/orgapachetika-1020/
> > >
> > > Please vote on releasing this package as Apache Tika 1.14.
> > > The vote is open for the next 72 hours and passes if a majority of at
> > > least three +1 Tika PMC votes are cast.
> > >
> > > [ ] +1 Release this package as Apache Tika 1.14 [ ] -1 Do not release
> > > this package because..
> > >
> > > Cheers,
> > > Chris
> > >
> > > P.S. Of course here is my +1.
> > >
> > >
> > >
> > >
> > >
> > >
> >
> >
> > --
> >
> > *Open Source Solutions for Text Engineering*
> >
> > http://www.digitalpebble.com
> > http://digitalpebble.blogspot.com/
> > #digitalpebble 
> >
>
>
>
> --
>
> *Open Source Solutions for Text Engineering*
>
> http://www.digitalpebble.com
> http://digitalpebble.blogspot.com/
> #digitalpebble 
>


Re: 1.15?

2017-04-18 Thread Oleg Tikhonov
+1 for the release.

On Mon, Apr 17, 2017 at 8:39 PM, David Meikle  wrote:

> +1 from me too.
>
> Cheers,
> Dave
>
> On 13 April 2017 at 13:08, Konstantin Gribov  wrote:
>
> > Preliminary +1 from me, I'll the a closer look this weekend
> >
> > чт, 13 апр. 2017, 0:00 Allison, Timothy B. :
> >
> > > All,
> > >   POI is voting on rc1 of the next release.  Once that's released and
> > > integrated into Tika, let's start the release process for Tika 1.15,
> end
> > of
> > > next week, middle of following?  Any blockers?
> > >
> > >  Cheers,
> > >
> > >  Tim
> > >
> > >
> > > --
> >
> > Best regards,
> > Konstantin Gribov
> >
>


Re: [VOTE] Release Apache Tika 1.16 Candidate #1

2017-07-12 Thread Oleg Tikhonov
[x]+1  Release this package as Apache Tika 1.16
Basic tests and build on Ubuntu 17.04 + Java 8 (Oracle).

Thanks,
Oleg

On Wed, Jul 12, 2017 at 11:03 AM, Dave Meikle  wrote:

> On 8 July 2017 at 03:40, Tim Allison  wrote:
>
> >
> > A candidate for the Tika 1.16 release is available at:
> > https://dist.apache.org/repos/dist/dev/tika/
> >
> > The release candidate is a zip archive of the sources in:
> > https://github.com/apache/tika/tree/1.16-rc1
> >
> > The SHA1 checksum of the archive is
> > e6884af0209ace42bf0b9b59d72c3c5a0052055e
> >
> > In addition, a staged maven repository is available here:
> > https://repository.apache.org/content/repositories/orgapachetika-1025
> >
> > Please vote on releasing this package as Apache Tika 1.16.
> > The vote is open for the next 72 hours and passes if a majority of at
> > least three +1 Tika PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Tika 1.16
> > [ ] -1 Do not release this package because...
> >
> >
> +1 from me. Checksums and signatures good. Built and tested on various
> machines using Java 8. Been run in a production workload and all good.
>
> Cheers,
> Dave
>


Re: [VOTE] Release Apache Tika 1.15 Candidate #1

2017-05-23 Thread Oleg Tikhonov
Hi guys,
Here is wrong ...

org.apache.tika
tika-parent
1.16-SNAPSHOT
tika-parent/pom.xml
  


If you are cloning the project, the upper level pom contains this.
The fix is to change 1.16-SNAPSHOT to 1.15

What i did was:
git clone https://github.com/apache/tika.git

Any suggestions?

BR,
OLeg




On Tue, May 23, 2017 at 3:01 PM, Allison, Timothy B. 
wrote:

> I _think_ it is included.  See below for the two options for parsing
> testZipEncrypted.zip.
>
> Are you not seeing this behavior?  Were you expecting different behavior?
>
>
> 1) RecursiveParserWrapper
>
> List metadataList = getRecursiveMetadata("
> testZipEncrypted.zip");
> debug(metadataList);
>
> yields:
>
> 0: X-Parsed-By : org.apache.tika.parser.DefaultParser
> 0: X-Parsed-By : org.apache.tika.parser.pkg.PackageParser
> 0: X-TIKA:EXCEPTION:embedded_stream_exception : 
> org.apache.tika.exception.EncryptedDocumentException:
> stream (encrypted.txt) is encrypted
> at org.apache.tika.parser.pkg.PackageParser.parseEntry(
> PackageParser.java:306)
> at org.apache.tika.parser.pkg.PackageParser.parse(
> PackageParser.java:230)
> at org.apache.tika.parser.CompositeParser.parse(
> CompositeParser.java:280)
> at org.apache.tika.parser.CompositeParser.parse(
> CompositeParser.java:280)
> at org.apache.tika.parser.AutoDetectParser.parse(
> AutoDetectParser.java:135)
> at org.apache.tika.parser.RecursiveParserWrapper.parse(
> RecursiveParserWrapper.java:158)
> at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.
> java:221)
> at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.
> java:213)
> at org.apache.tika.parser.pkg.ZipParserTest.testZipEncrypted(
> ZipParserTest.java:213)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(
> FrameworkMethod.java:50)
> at org.junit.internal.runners.model.ReflectiveCallable.run(
> ReflectiveCallable.java:12)
> at org.junit.runners.model.FrameworkMethod.invokeExplosively(
> FrameworkMethod.java:47)
> at org.junit.internal.runners.statements.InvokeMethod.
> evaluate(InvokeMethod.java:17)
> at org.junit.internal.runners.statements.RunBefores.
> evaluate(RunBefores.java:26)
> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> at org.junit.runners.BlockJUnit4ClassRunner.runChild(
> BlockJUnit4ClassRunner.java:78)
> at org.junit.runners.BlockJUnit4ClassRunner.runChild(
> BlockJUnit4ClassRunner.java:57)
> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> at org.junit.runners.ParentRunner.runChildren(
> ParentRunner.java:288)
> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> at org.junit.runners.ParentRunner$2.evaluate(
> ParentRunner.java:268)
> at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
> at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(
> JUnit4IdeaTestRunner.java:68)
> at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.
> startRunnerWithArgs(IdeaTestRunner.java:51)
> at com.intellij.rt.execution.junit.JUnitStarter.
> prepareStreamsAndStart(JUnitStarter.java:242)
> at com.intellij.rt.execution.junit.JUnitStarter.main(
> JUnitStarter.java:70)
>
> 0: X-TIKA:parse_time_millis : 34
> 0: X-TIKA:content : http://www.w3.org/1999/xhtml;>
> 
> 
>  />
> 
> 
> 
> 
> unencrypted.txt
> 
> encrypted.txt
> 
> 0: Content-Type : application/zip
> 1: date : 2017-03-21T13:07:48Z
> 1: X-Parsed-By : org.apache.tika.parser.DefaultParser
> 1: X-Parsed-By : org.apache.tika.parser.txt.TXTParser
> 1: resourceName : unencrypted.txt
> 1: dcterms:modified : 2017-03-21T13:07:48Z
> 1: Last-Modified : 2017-03-21T13:07:48Z
> 1: Last-Save-Date : 2017-03-21T13:07:48Z
> 1: embeddedRelationshipId : unencrypted.txt
> 1: meta:save-date : 2017-03-21T13:07:48Z
> 1: Content-Encoding : windows-1252
> 1: X-TIKA:parse_time_millis : 3
> 1: modified : 2017-03-21T13:07:48Z
> 1: X-TIKA:content : http://www.w3.org/1999/xhtml;>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> hello world
> 
> 
> 1: Content-Length : 13
> 1: X-TIKA:embedded_resource_path : /unencrypted.txt
> 1: Content-Type : text/plain; charset=windows-1252
>
> 2) Classic XML:
>
> XMLResult r = getXML("testZipEncrypted.zip");
> for (String n : r.metadata.names()) {
> for (String v : r.metadata.getValues(n)) {
> 

Re: [VOTE] Release Apache Tika 1.15 Candidate #1

2017-05-23 Thread Oleg Tikhonov
Also put
./tika-dl/src/test/java/org/apache/tika/dl/imagerec/DL4JInceptionV3NetTest.java
@Ignore because I do not have any DL installed on my comp.


On Tue, May 23, 2017 at 11:00 PM, Oleg Tikhonov <o...@apache.org> wrote:

> Hi guys,
> Here is wrong ...
> 
> org.apache.tika
> tika-parent
> 1.16-SNAPSHOT
> tika-parent/pom.xml
>   
>
>
> If you are cloning the project, the upper level pom contains this.
> The fix is to change 1.16-SNAPSHOT to 1.15
>
> What i did was:
> git clone https://github.com/apache/tika.git
>
> Any suggestions?
>
> BR,
> OLeg
>
>
>
>
> On Tue, May 23, 2017 at 3:01 PM, Allison, Timothy B. <talli...@mitre.org>
> wrote:
>
>> I _think_ it is included.  See below for the two options for parsing
>> testZipEncrypted.zip.
>>
>> Are you not seeing this behavior?  Were you expecting different behavior?
>>
>>
>> 1) RecursiveParserWrapper
>>
>> List metadataList = getRecursiveMetadata("testZipE
>> ncrypted.zip");
>> debug(metadataList);
>>
>> yields:
>>
>> 0: X-Parsed-By : org.apache.tika.parser.DefaultParser
>> 0: X-Parsed-By : org.apache.tika.parser.pkg.PackageParser
>> 0: X-TIKA:EXCEPTION:embedded_stream_exception :
>> org.apache.tika.exception.EncryptedDocumentException: stream
>> (encrypted.txt) is encrypted
>> at org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageP
>> arser.java:306)
>> at org.apache.tika.parser.pkg.PackageParser.parse(PackageParser
>> .java:230)
>> at org.apache.tika.parser.CompositeParser.parse(CompositeParser
>> .java:280)
>> at org.apache.tika.parser.CompositeParser.parse(CompositeParser
>> .java:280)
>> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectPars
>> er.java:135)
>> at org.apache.tika.parser.RecursiveParserWrapper.parse(Recursiv
>> eParserWrapper.java:158)
>> at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:
>> 221)
>> at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:
>> 213)
>> at org.apache.tika.parser.pkg.ZipParserTest.testZipEncrypted(Zi
>> pParserTest.java:213)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
>> ssorImpl.java:62)
>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
>> thodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:498)
>> at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(
>> FrameworkMethod.java:50)
>> at org.junit.internal.runners.model.ReflectiveCallable.run(Refl
>> ectiveCallable.java:12)
>> at org.junit.runners.model.FrameworkMethod.invokeExplosively(Fr
>> ameworkMethod.java:47)
>> at org.junit.internal.runners.statements.InvokeMethod.evaluate(
>> InvokeMethod.java:17)
>> at org.junit.internal.runners.statements.RunBefores.evaluate(
>> RunBefores.java:26)
>> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>> at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit
>> 4ClassRunner.java:78)
>> at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit
>> 4ClassRunner.java:57)
>> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:
>> 71)
>> at org.junit.runners.ParentRunner.runChildren(ParentRunner.
>> java:288)
>> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:
>> 58)
>> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:
>> 268)
>> at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>> at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
>> at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs
>> (JUnit4IdeaTestRunner.java:68)
>> at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.star
>> tRunnerWithArgs(IdeaTestRunner.java:51)
>> at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsA
>> ndStart(JUnitStarter.java:242)
>> at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStart
>> er.java:70)
>>
>> 0: X-TIKA:parse_time_millis : 34
>> 0: X-TIKA:content : http://www.w3.org/1999/xhtml;>
>> 
>> > />
>> > />
>> 
>> 
>> 
>> 
>> une

Re: [VOTE] Release Apache Tika 1.15 Candidate #2

2017-05-24 Thread Oleg Tikhonov
[x] +1 Release this package as Apache Tika 1.15

[INFO]

[INFO] BUILD SUCCESS
[INFO]

[INFO] Total time: 19:41 min
[INFO] Finished at: 2017-05-24T22:22:17+03:00
[INFO] Final Memory: 116M/983M
[INFO]


Tested on Ubuntu 1.16 x86_64

Thanks !!!



On Wed, May 24, 2017 at 4:22 AM, Tim Allison  wrote:

> A second candidate for the Tika 1.15 release is available at:
> https://dist.apache.org/repos/dist/dev/tika/
>
> The release candidate is a zip archive of the sources in:
> https://github.com/apache/tika/tree/1.15-rc2/
>
> The SHA1 checksum of the archive is
> e283468e47855f9142578c126e12f02eb5b08d2b.
>
> In addition, a staged maven repository is available here:
>
> https://repository.apache.org/content/repositories/
> orgapachetika-1023/org/apache/tika/
>
> Please vote on releasing this package as Apache Tika 1.15.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Tika PMC votes are cast.
>
> [ ] +1 Release this package as Apache Tika 1.15
> [ ] -1 Do not release this package because...
>
>
> -Tim
>
> P.S. This is my +1.
>


Re: [VOTE] Release Apache Tika 1.15 Candidate #1

2017-05-24 Thread Oleg Tikhonov
Cannot reproduce after having done some workarounds ...



On Wed, May 24, 2017 at 3:05 AM, Allison, Timothy B. <talli...@mitre.org>
wrote:

> Hi Oleg,
>   What's your error on that unit test?
>
> -Original Message-
> From: olegtikho...@gmail.com [mailto:olegtikho...@gmail.com] On Behalf Of
> Oleg Tikhonov
> Sent: Tuesday, May 23, 2017 4:33 PM
> To: dev@tika.apache.org
> Subject: Re: [VOTE] Release Apache Tika 1.15 Candidate #1
>
> Also put
> ./tika-dl/src/test/java/org/apache/tika/dl/imagerec/
> DL4JInceptionV3NetTest.java
> @Ignore because I do not have any DL installed on my comp.
>
>
> On Tue, May 23, 2017 at 11:00 PM, Oleg Tikhonov <o...@apache.org> wrote:
>
> > Hi guys,
> > Here is wrong ...
> > 
> > org.apache.tika
> > tika-parent
> > 1.16-SNAPSHOT
> > tika-parent/pom.xml
> >   
> >
> >
> > If you are cloning the project, the upper level pom contains this.
> > The fix is to change 1.16-SNAPSHOT to 1.15
> >
> > What i did was:
> > git clone https://github.com/apache/tika.git
> >
> > Any suggestions?
> >
> > BR,
> > OLeg
> >
> >
> >
> >
> > On Tue, May 23, 2017 at 3:01 PM, Allison, Timothy B.
> > <talli...@mitre.org>
> > wrote:
> >
> >> I _think_ it is included.  See below for the two options for parsing
> >> testZipEncrypted.zip.
> >>
> >> Are you not seeing this behavior?  Were you expecting different
> behavior?
> >>
> >>
> >> 1) RecursiveParserWrapper
> >>
> >> List metadataList = getRecursiveMetadata("testZipE
> >> ncrypted.zip");
> >> debug(metadataList);
> >>
> >> yields:
> >>
> >> 0: X-Parsed-By : org.apache.tika.parser.DefaultParser
> >> 0: X-Parsed-By : org.apache.tika.parser.pkg.PackageParser
> >> 0: X-TIKA:EXCEPTION:embedded_stream_exception :
> >> org.apache.tika.exception.EncryptedDocumentException: stream
> >> (encrypted.txt) is encrypted
> >> at
> >> org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageP
> >> arser.java:306)
> >> at
> >> org.apache.tika.parser.pkg.PackageParser.parse(PackageParser
> >> .java:230)
> >> at
> >> org.apache.tika.parser.CompositeParser.parse(CompositeParser
> >> .java:280)
> >> at
> >> org.apache.tika.parser.CompositeParser.parse(CompositeParser
> >> .java:280)
> >> at
> >> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectPars
> >> er.java:135)
> >> at
> >> org.apache.tika.parser.RecursiveParserWrapper.parse(Recursiv
> >> eParserWrapper.java:158)
> >> at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:
> >> 221)
> >> at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:
> >> 213)
> >> at
> >> org.apache.tika.parser.pkg.ZipParserTest.testZipEncrypted(Zi
> >> pParserTest.java:213)
> >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >> at
> >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
> >> ssorImpl.java:62)
> >> at
> >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
> >> thodAccessorImpl.java:43)
> >> at java.lang.reflect.Method.invoke(Method.java:498)
> >> at
> >> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(
> >> FrameworkMethod.java:50)
> >> at
> >> org.junit.internal.runners.model.ReflectiveCallable.run(Refl
> >> ectiveCallable.java:12)
> >> at
> >> org.junit.runners.model.FrameworkMethod.invokeExplosively(Fr
> >> ameworkMethod.java:47)
> >> at
> >> org.junit.internal.runners.statements.InvokeMethod.evaluate(
> >> InvokeMethod.java:17)
> >> at org.junit.internal.runners.statements.RunBefores.evaluate(
> >> RunBefores.java:26)
> >> at org.junit.runners.ParentRunner.runLeaf(
> ParentRunner.java:325)
> >> at
> >> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit
> >> 4ClassRunner.java:78)
> >> at
> >> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit
> >> 4ClassRunner.java:57)
> >> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> >> at org.junit.runners.ParentRunner$1.schedule(ParentRunn

Re: experiences with Tika in Docker

2017-06-02 Thread Oleg Tikhonov
Guys, i can help with Tika dockerization. just let design/plan what we
gonna do.

On Thu, Jun 1, 2017 at 4:02 PM, Eric Pugh 
wrote:

> As the Tika project starts embracing more non Java tools (I’m thinking of
> Tesseract for example), dockerizing your Tika setup becomes more and more
> valuable.
>
> For example, I run my tests for my application on my local Mac, as well as
> on CircleCI.   I have a dockeriezed Tika service that does the OCR stuff,
> and I know it’s the same work on both.   It’s less exciting if I’m in an
> “all Java” world.
>
>
> > On Jun 1, 2017, at 7:55 AM, Allison, Timothy B. 
> wrote:
> >
> > Thank you, Thejan!
> >
> > -Original Message-
> > From: Thejan Wijesinghe [mailto:thejan.k.wijesin...@gmail.com]
> > Sent: Wednesday, May 31, 2017 5:40 PM
> > To: dev@tika.apache.org
> > Subject: Re: experiences with Tika in Docker
> >
> > Hi Tim,
> >
> > I've used Tika -server in docker but as a single instance only. Yes, its
> ability to limit container's resources with related to memory & CPU in the
> host machine is great, it gives us so much flexibility, we could enforce
> hard/soft memory limits, we could even manipulate the host machine's CPU
> cycles. Yes, it also limits risks of executing arbitrary code & XXE
> vulnerabilities. I already asked Prof. Chris Mattmann about officially
> moving to dockerhub. He said I need to make a mail to apache infra asking
> about this. Unfortunately, I still couldn't find a time to make that mail.
> >
> > We already have multiple dockerfiles in Tika, , dockerfile in
> tika-server, InceptionRestDockerfile, InceptionVideoRestDockerfile,
> Im2txtRestDockerfile(PR #180-for image captioning).
> >
> > Part of my GSoC project is to unify the existing REST services such as
> object recognition, image captioning. My idea is to unify all of those REST
> services where the user can start/terminate, see statistics of any REST
> service through a web based GUI. I'm expecting to use a fusion of nginx(as
> the reverse proxy server) & docker to make it work. So obviously we will
> see docker much often in Tika.
> >
> > +1 for your thought to looking into hardening the tika-server with the
> > +help
> > of docker.
> >
> > best,
> > ThejanW
> >
> > On Thu, Jun 1, 2017 at 1:03 AM, Allison, Timothy B. 
> > wrote:
> >
> >> Dave Meikle, Tom and All,
> >>
> >>How many of us are using Tika in Docker?  If so, how exactly are
> >> you using it?  Single instance, swarm, Kubernetes, something else?
> >> People fear I/O hit with tika-server...what are your experiences?
> >> I really like the ability to limit the number of CPUs in the Docker
> >> container.  If a single doc causes multithreaded gc to go nuts, that
> >> won't kill an entire machine.  This also cleanly limits the risk from
> >> XXE or arbitrary code execution, right?
> >>
> >> If this is one of the ways of the future for big data, we might want
> >> to look into hardening tika-server (OOMs, timeouts).  What do you all
> think?
> >>
> >>Cheers,
> >>
> >>Tim
> >>
> >> Timothy B. Allison, Ph.D.
> >> Principal Artificial Intelligence Engineer Group Lead K83E/Human
> >> Language Technology The MITRE Corporation
> >> 7515 Colshire Drive, McLean, VA  22102
> >> 703-983-2473 (phone); 703-983-1379 (fax)
> >>
> >>
>
>
> ___
> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
> http://www.opensourceconnections.com  opensourceconnections.com/> | My Free/Busy 
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-
> enterprise-search-server-third-edition-raw>
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless of
> whether attachments are marked as such.
>
>


Re: [jira] [Created] (TIKA-2647) Create a "security" page on our website

2018-05-22 Thread Oleg Tikhonov
Hi Tim,
definitely would be helpful !
+1
Thanks,
Oleg

On Tue, May 22, 2018 at 3:38 PM, Tim Allison (JIRA)  wrote:

> Tim Allison created TIKA-2647:
> -
>
>  Summary: Create a "security" page on our website
>  Key: TIKA-2647
>  URL: https://issues.apache.org/jira/browse/TIKA-2647
>  Project: Tika
>   Issue Type: New Feature
> Reporter: Tim Allison
>
>
> I think it would be helpful for us to document any CVEs we've had on one
> central page on our website.  WDYT?
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v7.6.3#76005)
>


Re: [jira] [Commented] (TIKA-2725) Make tika-server robust against ooms/infinite loops/memory leaks

2018-09-06 Thread Oleg Tikhonov
In this approach, probably it is the only way ...
What is tika-server typical env? stand-alone, distributed ... like replicas
in cluster?
Are there some time limitation for recovery? How do we know what point to
start processing from?
Do we mark documents which were processed?
For example, if tika-server had run on Docker swarm/K8S then orchestrator
would have restarted a failed replica itself ...


On Thu, Sep 6, 2018 at 4:58 PM Tim Allison (JIRA)  wrote:

>
> [
> https://issues.apache.org/jira/browse/TIKA-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16605816#comment-16605816
> ]
>
> Tim Allison commented on TIKA-2725:
> ---
>
> From [~o...@apache.org] on the dev list:
>
> bq. What if watcher thread fails/gets stuck etc?
>
> To confirm, that's the watcher thread in the child process.  Y, that's why
> I think we should also have a ping from the parent process.  WDYT?
>
> > Make tika-server robust against ooms/infinite loops/memory leaks
> > 
> >
> > Key: TIKA-2725
> > URL: https://issues.apache.org/jira/browse/TIKA-2725
> > Project: Tika
> >  Issue Type: Task
> >Reporter: Tim Allison
> >Assignee: Tim Allison
> >Priority: Major
> >
> > Currently, tika-server is vulnerable to ooms, inifinite loops and memory
> leaks.  I see two ways of making it robust:
> > 1) use the ForkParser
> > 2) have tika-server spawn a child process that actually runs the server,
> put a watcher thread in the child that will kill the child on
> oom/timeout/after x files.  The parent process can then restart the child
> if it dies.
> > I somewhat prefer 2) so that we don't have to doubly pass the
> inputstream.  I propose 2), and I propose making it optional in Tika 1.x,
> but then the default in Tika 2.x.  We could also add a status ping from
> parent to child in case the child gets caught up in stop the world gc (h/t
> [~bleskes]).
> > Other options/recommendations?
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v7.6.3#76005)
>


Re: [jira] [Created] (TIKA-2725) Make tika-server robust against ooms/infinite loops/memory leaks

2018-09-06 Thread Oleg Tikhonov
Hi Tim,
What if watcher thread fails/gets stuck etc?



On Thu, Sep 6, 2018 at 3:27 PM Tim Allison (JIRA)  wrote:

> Tim Allison created TIKA-2725:
> -
>
>  Summary: Make tika-server robust against ooms/infinite
> loops/memory leaks
>  Key: TIKA-2725
>  URL: https://issues.apache.org/jira/browse/TIKA-2725
>  Project: Tika
>   Issue Type: Task
> Reporter: Tim Allison
> Assignee: Tim Allison
>
>
> Currently, tika-server is vulnerable to ooms, inifinite loops and memory
> leaks.  I see two ways of making it robust:
>
> 1) use the ForkParser
> 2) have tika-server spawn a child process that actually runs the server,
> put a watcher thread in the child that will kill the child on
> oom/timeout/after x files.  The parent process can then restart the child
> if it dies.
>
> I somewhat prefer 2) so that we don't have to doubly pass the
> inputstream.  I propose 2), and I propose making it optional in Tika 1.x,
> but then the default in Tika 2.x.  We could also add a status ping from
> parent to child in case the child gets caught up in stop the world gc (h/t
> [~bleskes]).
>
> Other options/recommendations?
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v7.6.3#76005)
>


Re: [jira] [Commented] (TIKA-2725) Make tika-server robust against ooms/infinite loops/memory leaks

2018-09-06 Thread Oleg Tikhonov
Ideally, tika server is dockerized, runs on swarm as a service. In
addition, it has healthckeck mechanism, say something ... like http get
request with return code 200. Docker will runs this hc periodically, and if
it fails, will restart tika server.
However, we are far away. Two ways to go, fmpov ... 1. Your second option
or ... os deamon which will check tika server availability or something
like that. We can use cron on Linux to run our "healthcheck" and if it
detects some anomalies, will restart a server. Probably for windows we can
find such mecanism as well.


On Thu, Sep 6, 2018, 18:29 Tim Allison (JIRA)  wrote:

>
> [
> https://issues.apache.org/jira/browse/TIKA-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16605925#comment-16605925
> ]
>
> Tim Allison commented on TIKA-2725:
> ---
>
> bq. What is tika-server typical env? stand-alone, distributed ... like
> replicas in cluster?
>
> It varies, I'm sure.  Not sure what most common use case is.  I would hope
> distributed -- swarm or similar.
>
> bq. Are there some time limitation for recovery?
>
> I think whoever starts the server should be able to set the threshold for
> timeouts per file...although I may misunderstand your question.
>
> bq.  How do we know what point to start processing from?
> That wouldn't be tika-server's problem.  Clients calling tika-server would
> get an error message, or potentially no response within a socket/http
> timeout range.  They should not reprocess those docs.
>
> bq. Do we mark documents which were processed?
> Same as above, that's a client concern.
>
> bq. For example, if tika-server had run on Docker swarm/K8S then
> orchestrator would have restarted a failed replica itself
> To confirm that I understand this correctly, currently, if the tika-server
> process dies, swarm/k8s will automatically restart it?  That's good to
> hear.  However, we don't currently have the watcher thread within
> tika-server to kill its own process on oom/timeout...so as it is now, it
> would have to be something catastrophic taking down tika-server (operating
> system, perhaps?).
>
>
>
>
> > Make tika-server robust against ooms/infinite loops/memory leaks
> > 
> >
> > Key: TIKA-2725
> > URL: https://issues.apache.org/jira/browse/TIKA-2725
> > Project: Tika
> >  Issue Type: Task
> >Reporter: Tim Allison
> >Assignee: Tim Allison
> >Priority: Major
> >
> > Currently, tika-server is vulnerable to ooms, inifinite loops and memory
> leaks.  I see two ways of making it robust:
> > 1) use the ForkParser
> > 2) have tika-server spawn a child process that actually runs the server,
> put a watcher thread in the child that will kill the child on
> oom/timeout/after x files.  The parent process can then restart the child
> if it dies.
> > I somewhat prefer 2) so that we don't have to doubly pass the
> inputstream.  I propose 2), and I propose making it optional in Tika 1.x,
> but then the default in Tika 2.x.  We could also add a status ping from
> parent to child in case the child gets caught up in stop the world gc (h/t
> [~bleskes]).
> > Other options/recommendations?
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v7.6.3#76005)
>


Re: [jira] [Commented] (TIKA-2725) Make tika-server robust against ooms/infinite loops/memory leaks

2018-09-07 Thread Oleg Tikhonov
Yep, seems to be best match... unblocked execution.


On Thu, Sep 6, 2018, 23:47 Tim Allison (JIRA)  wrote:

>
> [
> https://issues.apache.org/jira/browse/TIKA-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16606373#comment-16606373
> ]
>
> Tim Allison commented on TIKA-2725:
> ---
>
> {quote}
> Ideally, tika server is dockerized, runs on swarm as a service. In
> addition, it has healthckeck mechanism, say something ... like http get
> request with return code 200. Docker will runs this hc periodically, and if
> it fails, will restart tika server.
> However, we are far away. Two ways to go, fmpov ... 1. Your second option
> or ... os deamon which will check tika server availability or something
> like that. We can use cron on Linux to run our "healthcheck" and if it
> detects some anomalies, will restart a server. Probably for windows we can
> find such mecanism as well.
> {quote}
>
> CommonsExec?
>
> > Make tika-server robust against ooms/infinite loops/memory leaks
> > 
> >
> > Key: TIKA-2725
> > URL: https://issues.apache.org/jira/browse/TIKA-2725
> > Project: Tika
> >  Issue Type: Task
> >Reporter: Tim Allison
> >Assignee: Tim Allison
> >Priority: Major
> >
> > Currently, tika-server is vulnerable to ooms, inifinite loops and memory
> leaks.  I see two ways of making it robust:
> > 1) use the ForkParser
> > 2) have tika-server spawn a child process that actually runs the server,
> put a watcher thread in the child that will kill the child on
> oom/timeout/after x files.  The parent process can then restart the child
> if it dies.
> > I somewhat prefer 2) so that we don't have to doubly pass the
> inputstream.  I propose 2), and I propose making it optional in Tika 1.x,
> but then the default in Tika 2.x.  We could also add a status ping from
> parent to child in case the child gets caught up in stop the world gc (h/t
> [~bleskes]).
> > Other options/recommendations?
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v7.6.3#76005)
>


Re: [VOTE] Release Apache Tika 1.18 Candidate #1

2018-04-11 Thread Oleg Tikhonov
[+] Release this package as Apache Tika 1.18

[INFO] Apache Tika parent . SUCCESS [
12.379 s]
[INFO] Apache Tika core ... SUCCESS [
55.650 s]
[INFO] Apache Tika parsers  SUCCESS [05:55
min]
[INFO] Apache Tika XMP  SUCCESS [
7.254 s]
[INFO] Apache Tika serialization .. SUCCESS [
3.857 s]
[INFO] Apache Tika batch .. SUCCESS [02:13
min]
[INFO] Apache Tika language detection . SUCCESS [
8.152 s]
[INFO] Apache Tika application  SUCCESS [01:13
min]
[INFO] Apache Tika OSGi bundle  SUCCESS [
57.625 s]
[INFO] Apache Tika translate .. SUCCESS [
8.393 s]
[INFO] Apache Tika server . SUCCESS [01:05
min]
[INFO] Apache Tika examples ... SUCCESS [
19.053 s]
[INFO] Apache Tika Java-7 Components .. SUCCESS [
5.646 s]
[INFO] Apache Tika eval ... SUCCESS [
44.564 s]
[INFO] Apache Tika Deep Learning (powered by DL4J)  SUCCESS [07:45
min]
[INFO] Apache Tika Natural Language Processing  SUCCESS [01:47
min]
[INFO] Apache Tika  SUCCESS [
0.145 s]
[INFO] 

[INFO] BUILD SUCCESS

CentOS 7.3. Did only basic stuff.

I've seen that we have Docker image build script. Is there some
documentation?
I will dig into it ...
Thanks a lot,
Oleg

On Tue, Apr 10, 2018 at 3:36 PM, Tim Allison  wrote:

> A candidate for the Tika 1.18 release is available at:
>   https://dist.apache.org/repos/dist/dev/tika/
>
> The release candidate is a zip archive of the sources in:
>   https://github.com/apache/tika/tree/1.18-rc1/
>
> The SHA-512 checksum of the archive is
>   7f2e76e2973c9a0c3ba572afa74686ff95f0628136940b592c61d3639fe8
> 123f977fe321693a6c02a650172f3ef442e7a3adfa93d81d1d770233e47d8911b79e.
>
> In addition, a staged maven repository is available here:
>   https://repository.apache.org/content/repositories/orgapachetika-1031
>
>
>
> Please vote on releasing this package as Apache Tika 1.18.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Tika PMC votes are cast.
>
> [ ] +1 Release this package as Apache Tika 1.18
> [ ] -1 Do not release this package because...
>
> Here's my +1
>
> On behalf of the Apache Tika team,
>
>  Tim
>


Re: [VOTE] Release Apache Tika 1.18 Candidate #3

2018-04-22 Thread Oleg Tikhonov
Hi,
thanks a lot.
[x] +1 Release this package as Apache Tika 1.18

Even did a security scan:
mvn org.owasp:dependency-check-maven:3.1.2:check

Report is attached.

Best regards,
Oleg


On Sat, Apr 21, 2018 at 12:54 AM, talli...@apache.org 
wrote:

> All,
> A candidate for the Tika 1.18 release is available at:
> https://dist.apache.org/repos/dist/dev/tika/
> The release candidate is a zip archive of the sources in:
> https://github.com/apache/tika/tree/1.18-rc3
> The SHA-512 checksum of the archive isf69ee27b31cf7bcb1eaf114b93c23d
> d85b974356cc7e6e265b1c9366a11d711a3341e690f5b452a3e8b0c5cc6f
> 5839db01b3ef6ec3a2a29ffcd332ff7a63dcf3.
> In addition, a staged maven repository is available here:
> https://repository.apache.org/content/repositories/orgapachetika-1033
> Please vote on releasing this package as Apache Tika 1.18.The vote is open
> for the next 72 hours and passes if a majority of atleast three +1 Tika PMC
> votes are cast.
> [ ] +1 Release this package as Apache Tika 1.18[ ] -1 Do not release this
> package because...
> +1 from me; third time's the charm...
> Cheers,
> Tim


  1   2   >