Re: OCR with tika-server

2014-10-06 Thread kevin slote
Ok, I am signed up.

https://wiki.apache.org/tika/Kevin%20Slote

On Fri, Oct 3, 2014 at 11:02 PM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Kevin glad it is now fixed with you!
>
> If you get a chance, please feel free to document
> this on the wiki:
>
> https://wiki.apache.org/tika/TikaOCR
>
>
> You can sign up for an account, and then I can grant
> you permissions to edit the file. Let me know!
>
> Cheers,
> Chris
>
>
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++
>
>
>
>
>
>
> -Original Message-
> From: kevin slote 
> Reply-To: "dev@tika.apache.org" 
> Date: Friday, October 3, 2014 at 4:10 PM
> To: "dev@tika.apache.org" 
> Subject: Re: OCR with tika-server
>
> >Hi all,
> >
> >I just confirmed that the problem was that my version of tesseract was too
> >old.
> >Maybe it would be a good idea to put something in the canRun method at the
> >top of the tesseract unit test to also check that the version of tesseract
> >is relevant?
> >
> >Older versions of tesseract do not have a "-v" or "--version" flag.  So
> >maybe use ProcessBuilder to run that command and parse the string to see
> >if
> >it returned an error?
> >
> >Thanks for everyone's help.
> >
> >On Fri, Oct 3, 2014 at 2:30 PM, kevin slote  wrote:
> >
> >> Thanks for following up!
> >>
> >> I was trying to dig deeper before I responded.
> >>
> >> Tyler,
> >>
> >> I followed those instructions.  My version of Tesseract does not ocr the
> >> google logo because it is not a tiff.  I used imagemagick to convert it
> >>to
> >> a tif and tesseract returned "check_legal_image_size:Error:Only
> >>1,2,4,5,6,8
> >> bpp are supported:32" error which usually means it needs to be re-sized
> >> with imagemagick.
> >>
> >>
> >> Chris,
> >>
> >> I wrote a python wrapper for tesseract that can parse the documents that
> >> were in your test-document repository concerning OCR (testOCR.pdf,
> >>etc.) It
> >> looks like right now, in TesseractOCRParser.java, the command line
> >>argument
> >> that is passed to the os points to a .tmp file in /tmp/.
> >>
> >> So the command that is executed is
> >>
> >>"tesseract /tmp/apache-tika-2409864150710514587.tmp
> >> /tmp/apache-tika-1277985370508249503.tmp -l eng -psm 1"
> >>
> >> This is not working for me.  When I grab those .tmp files and try to ocr
> >> them from the command line, tesseract gets thrown for a loop.
> >>
> >> From what I can tell, is the tesseract I have installed can only handle
> >> .tif files.
> >> I can back this up by citing the tesseract page:
> >> https://code.google.com/p/tesseract-ocr/wiki/ReadMe
> >>
> >>  If Tesseract isn't available for your distribution, or you want to use
> >>a
> >> newer version than they offer, you can compile your own
> >> <https://code.google.com/p/tesseract-ocr/wiki/Compiling>. Note that
> >>older
> >> versions of Tesseract only supported processing .tiff files.
> >>
> >> So, I think that upgrading tesseract or moving to ubuntu 12 or higher
> >>will
> >> solve my problems.
> >>
> >> I will let the listserv know if that fixes it.
> >>
> >>
> >> Kevin Slote
> >>
> >>
> >>
> >> On Wed, Oct 1, 2014 at 5:13 PM, Mattmann, Chris A (3980) <
> >> chris.a.mattm...@jpl.nasa.gov> wrote:
> >>
> >>> What type of image is it, Kevin?
> >>>
> >>> If it’s a TIFF, you need to install tesseract with special lib tiff
> >>> parameters. See:
> >>>
> >>> https://gist.github.com/henrik/1967035
> >>>
> >>>
> >>> Can you parse the image file with tesseract by itself, without
> &g

Re: OCR with tika-server

2014-10-03 Thread kevin slote
Hi all,

I just confirmed that the problem was that my version of tesseract was too
old.
Maybe it would be a good idea to put something in the canRun method at the
top of the tesseract unit test to also check that the version of tesseract
is relevant?

Older versions of tesseract do not have a "-v" or "--version" flag.  So
maybe use ProcessBuilder to run that command and parse the string to see if
it returned an error?

Thanks for everyone's help.

On Fri, Oct 3, 2014 at 2:30 PM, kevin slote  wrote:

> Thanks for following up!
>
> I was trying to dig deeper before I responded.
>
> Tyler,
>
> I followed those instructions.  My version of Tesseract does not ocr the
> google logo because it is not a tiff.  I used imagemagick to convert it to
> a tif and tesseract returned "check_legal_image_size:Error:Only 1,2,4,5,6,8
> bpp are supported:32" error which usually means it needs to be re-sized
> with imagemagick.
>
>
> Chris,
>
> I wrote a python wrapper for tesseract that can parse the documents that
> were in your test-document repository concerning OCR (testOCR.pdf, etc.) It
> looks like right now, in TesseractOCRParser.java, the command line argument
> that is passed to the os points to a .tmp file in /tmp/.
>
> So the command that is executed is
>
>"tesseract /tmp/apache-tika-2409864150710514587.tmp
> /tmp/apache-tika-1277985370508249503.tmp -l eng -psm 1"
>
> This is not working for me.  When I grab those .tmp files and try to ocr
> them from the command line, tesseract gets thrown for a loop.
>
> From what I can tell, is the tesseract I have installed can only handle
> .tif files.
> I can back this up by citing the tesseract page:
> https://code.google.com/p/tesseract-ocr/wiki/ReadMe
>
>  If Tesseract isn't available for your distribution, or you want to use a
> newer version than they offer, you can compile your own
> <https://code.google.com/p/tesseract-ocr/wiki/Compiling>. Note that  older
> versions of Tesseract only supported processing .tiff files.
>
> So, I think that upgrading tesseract or moving to ubuntu 12 or higher will
> solve my problems.
>
> I will let the listserv know if that fixes it.
>
>
> Kevin Slote
>
>
>
> On Wed, Oct 1, 2014 at 5:13 PM, Mattmann, Chris A (3980) <
> chris.a.mattm...@jpl.nasa.gov> wrote:
>
>> What type of image is it, Kevin?
>>
>> If it’s a TIFF, you need to install tesseract with special lib tiff
>> parameters. See:
>>
>> https://gist.github.com/henrik/1967035
>>
>>
>> Can you parse the image file with tesseract by itself, without
>> Tika’s tmp image?
>>
>> ++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattm...@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++
>>
>>
>>
>>
>>
>>
>> -Original Message-
>> From: , "Paul M   (398J)" 
>> Reply-To: "dev@tika.apache.org" 
>> Date: Wednesday, October 1, 2014 at 1:47 PM
>> To: "" 
>> Subject: Re: OCR with tika-server
>>
>> >Nothing to be embarrassed about at all Kevin. I actually thought maybe it
>> >was just a typo issue and I randomly happen to catch that. I've
>> >definitely done that one before myself.
>> >
>> >Bummed that was not the problem.
>> >
>> >--Paul
>> >
>> >On Oct 1, 2014, at 1:00 PM, kevin slote 
>> > wrote:
>> >
>> >> What I wrote there did have a typo in it. (It's not every day you get
>> to
>> >> embarrass yourself in front of a bunch of guys from NASA)
>> >>
>> >> But that was not what I had in my terminal when I tested it.
>> >>
>> >>
>> >>
>> >> The actual PATH was:
>> >>
>> >>
>> >>
>> >>
>> >>
>>
>> >>"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/g
>> >>ames:/usr/bin/tesseract"
>> >>
>> >>
>> >>
>> >> I think what was actually wrong with the path is that I added

Re: OCR with tika-server

2014-10-03 Thread kevin slote
Thanks for following up!

I was trying to dig deeper before I responded.

Tyler,

I followed those instructions.  My version of Tesseract does not ocr the
google logo because it is not a tiff.  I used imagemagick to convert it to
a tif and tesseract returned "check_legal_image_size:Error:Only 1,2,4,5,6,8
bpp are supported:32" error which usually means it needs to be re-sized
with imagemagick.


Chris,

I wrote a python wrapper for tesseract that can parse the documents that
were in your test-document repository concerning OCR (testOCR.pdf, etc.) It
looks like right now, in TesseractOCRParser.java, the command line argument
that is passed to the os points to a .tmp file in /tmp/.

So the command that is executed is

   "tesseract /tmp/apache-tika-2409864150710514587.tmp
/tmp/apache-tika-1277985370508249503.tmp -l eng -psm 1"

This is not working for me.  When I grab those .tmp files and try to ocr
them from the command line, tesseract gets thrown for a loop.

>From what I can tell, is the tesseract I have installed can only handle
.tif files.
I can back this up by citing the tesseract page:
https://code.google.com/p/tesseract-ocr/wiki/ReadMe

 If Tesseract isn't available for your distribution, or you want to use a
newer version than they offer, you can compile your own
<https://code.google.com/p/tesseract-ocr/wiki/Compiling>. Note that  older
versions of Tesseract only supported processing .tiff files.

So, I think that upgrading tesseract or moving to ubuntu 12 or higher will
solve my problems.

I will let the listserv know if that fixes it.


Kevin Slote



On Wed, Oct 1, 2014 at 5:13 PM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> What type of image is it, Kevin?
>
> If it’s a TIFF, you need to install tesseract with special lib tiff
> parameters. See:
>
> https://gist.github.com/henrik/1967035
>
>
> Can you parse the image file with tesseract by itself, without
> Tika’s tmp image?
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>
>
>
>
> -Original Message-
> From: , "Paul M   (398J)" 
> Reply-To: "dev@tika.apache.org" 
> Date: Wednesday, October 1, 2014 at 1:47 PM
> To: "" 
> Subject: Re: OCR with tika-server
>
> >Nothing to be embarrassed about at all Kevin. I actually thought maybe it
> >was just a typo issue and I randomly happen to catch that. I've
> >definitely done that one before myself.
> >
> >Bummed that was not the problem.
> >
> >--Paul
> >
> >On Oct 1, 2014, at 1:00 PM, kevin slote 
> > wrote:
> >
> >> What I wrote there did have a typo in it. (It's not every day you get to
> >> embarrass yourself in front of a bunch of guys from NASA)
> >>
> >> But that was not what I had in my terminal when I tested it.
> >>
> >>
> >>
> >> The actual PATH was:
> >>
> >>
> >>
> >>
> >>
> >>"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/g
> >>ames:/usr/bin/tesseract"
> >>
> >>
> >>
> >> I think what was actually wrong with the path is that I added the entire
> >> path to the tesseract executable, which was in my /usr/bin/ directory,
> >> instead of just the directory where tesseract lives.  Is this true?
> >>
> >>
> >>
> >> I deleted the hard coding from the TesseractOCRConfig.jave and then
> >>printed
> >> config.getTesseractPath() to stdout.  This field was empty.
> >>
> >> However, I have tesseract installed system wide on this ubuntu vm.
> >>
> >> So the canRun method evaluated as true whether or not the tesseractPath
> >>was
> >> configured correctly.
> >>
> >>
> >>
> >> I have been slowly trying to debug this all day.  It looks like tika is
> >> making a tmp file with the .tmp preffix.
> >>
> >> I commented out some of the code to so that they remained in /tmp/.
> >>
> >>
> >>
> >> It looks like tesseract doesn't like that.
> >&

Re: OCR with tika-server

2014-10-01 Thread kevin slote
What I wrote there did have a typo in it. (It's not every day you get to
embarrass yourself in front of a bunch of guys from NASA)

But that was not what I had in my terminal when I tested it.



The actual PATH was:




"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/bin/tesseract"



I think what was actually wrong with the path is that I added the entire
path to the tesseract executable, which was in my /usr/bin/ directory,
instead of just the directory where tesseract lives.  Is this true?



I deleted the hard coding from the TesseractOCRConfig.jave and then printed
config.getTesseractPath() to stdout.  This field was empty.

However, I have tesseract installed system wide on this ubuntu vm.

So the canRun method evaluated as true whether or not the tesseractPath was
configured correctly.



I have been slowly trying to debug this all day.  It looks like tika is
making a tmp file with the .tmp preffix.

I commented out some of the code to so that they remained in /tmp/.



It looks like tesseract doesn't like that.

I tried to ocr these .tmp files to see if I could isolate what was going
wrong for me.



kslote@ubuntu:~/tika/tika$ tesseract
/tmp/apache-tika-7112319184053570698.tmp out

Tesseract Open Source OCR Engine

name_to_image_type:Error:Unrecognized image
type:/tmp/apache-tika-7112319184053570698.tmp

IMAGE::read_header:Error:Can't read this image
type:/tmp/apache-tika-7112319184053570698.tmp

tesseract:Error:Read of file failed:/tmp/apache-tika-7112319184053570698.tmp

Segmentation fault



On the wiki it mentions something about getting tesseract to work with
.tiff files.  For whatever reason, the tesseract I have installed only
works for .tiff files.  Would it be recommend that I re install tesseract
from the source?

On Tue, Sep 30, 2014 at 7:28 PM, Ramirez, Paul M (398J) <
paul.m.rami...@jpl.nasa.gov> wrote:

> Is that a typo in your path to tesseract?
>
> /urs/bin/tesseract => /usr/bin/tesseract
>
> --Paul
>
> > On Sep 30, 2014, at 1:48 PM, "kevin slote"  wrote:
> >
> > Unfortunately, that did not do it either.
> >
> > I did:
> >
> >   $export
> >
> PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/urs/bin/tesseract
> >
> > Here is the output from printenv
> >
> > kslote@ubuntu:~/tika/tika$ printenv
> > SHELL=/bin/bash
> > USERNAME=kslote
> > XDG_CONFIG_DIRS=/etc/xdg/xdg-gnome:/etc/xdg
> > DESKTOP_SESSION=gnome
> >
> PATH=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/urs/bin/tesseract
> > PWD=/home/kslote/tika/tika
> > HOME=/home/kslote
> > LOGNAME=kslote
> > _=/usr/bin/printenv
> >
> >
> > On Tue, Sep 30, 2014 at 4:13 PM, Tyler Palsulich 
> > wrote:
> >
> >> Hi,
> >>
> >> Hmm. Could you try adding tesseract to your PATH? How did you install
> >> Tesseract? You should be able to do a straightforward `sudo apt-get
> install
> >> tesseract-ocr`. After that, the OCR tests should pass. We're still
> running
> >> into TIKA-1422, where a mail test fails. But, you can run just the OCR
> >> tests with `mvn test -Dtest=org.apache.tika.parser.ocr.TesseractOCRTest
> >> -DfailIfNoTests=false`.
> >>
> >> Let me know if that works for you!
> >> Tyler
> >>
> >>> On Tue, Sep 30, 2014 at 4:00 PM, kevin slote 
> wrote:
> >>>
> >>> I am working on ubuntu 10.4. and I am having some trouble.
> >>> Tesseract is installed correctly, but just doing a clone from the repo
> >> and
> >>> installing with maven, I am getting some errors.
> >>>
> >>> This is before I did anything with tesseract installed.
> >>>
> >>> Failed tests:
>  testPPTXOCR(org.apache.tika.parser.ocr.TesseractOCRTest):
> >>> Check for the image's text.
> >>>  testDOCXOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
> >>>  testPDFOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
> >>>
> >>> Next I hard coded the tesseractPath:
> >>>
> >>> I went into the TesseractOCRConfig.java and hard coded 'tesseractPath.'
> >>> The all tests passed and it built successfully, but then I went to post
> >>> some tiff's to the server.
> >>> That didn't work. So I tried adding some System.out.println("hello
> >> world")
> >>> (a little crude I know) inside the unit tests to confirm that tesseract
> >>> was working correctly.  It looks like something happens in the unit
> test
> >> in
> >>> TesseractOCRTest

Re: OCR with tika-server

2014-09-30 Thread kevin slote
Unfortunately, that did not do it either.

I did:

   $export
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/urs/bin/tesseract

Here is the output from printenv

kslote@ubuntu:~/tika/tika$ printenv
SHELL=/bin/bash
USERNAME=kslote
XDG_CONFIG_DIRS=/etc/xdg/xdg-gnome:/etc/xdg
DESKTOP_SESSION=gnome
PATH=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/urs/bin/tesseract
PWD=/home/kslote/tika/tika
HOME=/home/kslote
LOGNAME=kslote
_=/usr/bin/printenv


On Tue, Sep 30, 2014 at 4:13 PM, Tyler Palsulich 
wrote:

> Hi,
>
> Hmm. Could you try adding tesseract to your PATH? How did you install
> Tesseract? You should be able to do a straightforward `sudo apt-get install
> tesseract-ocr`. After that, the OCR tests should pass. We're still running
> into TIKA-1422, where a mail test fails. But, you can run just the OCR
> tests with `mvn test -Dtest=org.apache.tika.parser.ocr.TesseractOCRTest
> -DfailIfNoTests=false`.
>
> Let me know if that works for you!
> Tyler
>
> On Tue, Sep 30, 2014 at 4:00 PM, kevin slote  wrote:
>
> > I am working on ubuntu 10.4. and I am having some trouble.
> > Tesseract is installed correctly, but just doing a clone from the repo
> and
> > installing with maven, I am getting some errors.
> >
> > This is before I did anything with tesseract installed.
> >
> > Failed tests:   testPPTXOCR(org.apache.tika.parser.ocr.TesseractOCRTest):
> > Check for the image's text.
> >   testDOCXOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
> >   testPDFOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
> >
> > Next I hard coded the tesseractPath:
> >
> > I went into the TesseractOCRConfig.java and hard coded 'tesseractPath.'
> > The all tests passed and it built successfully, but then I went to post
> > some tiff's to the server.
> > That didn't work. So I tried adding some System.out.println("hello
> world")
> >  (a little crude I know) inside the unit tests to confirm that tesseract
> > was working correctly.  It looks like something happens in the unit test
> in
> > TesseractOCRTest.java
> > on the line that says TesseractOCRConfig config = new
> > TesseractOCRConfig();. Printing to stdout before works, but I get nothing
> > after. That happens before the assumeTrue(canRun(config));. So an
> exception
> > is not get raised.
> >
> > Then once everything is built, ocr does not work.  That was why I
> figured I
> > would ask to see if I missed some sort of configuration step in building
> > it.
> >
> > Thanks a ton.
> >
> >
> >
> >
> >
> > On Tue, Sep 30, 2014 at 2:57 PM, Mattmann, Chris A (3980) <
> > chris.a.mattm...@jpl.nasa.gov> wrote:
> >
> > > Dear Kevin,
> > >
> > > Sure, it already works :) 1.7-SNAPSHOT.
> > >
> > > See this wiki page:
> > >
> > > https://wiki.apache.org/tika/TikaOCR
> > >
> > > I¹d be happy to discuss more.
> > >
> > > Thanks!
> > >
> > > Cheers,
> > > Chris
> > >
> > > ++
> > > Chris Mattmann, Ph.D.
> > > Chief Architect
> > > Instrument Software and Science Data Systems Section (398)
> > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > > Office: 168-519, Mailstop: 168-527
> > > Email: chris.a.mattm...@nasa.gov
> > > WWW:  http://sunset.usc.edu/~mattmann/
> > > ++
> > > Adjunct Associate Professor, Computer Science Department
> > > University of Southern California, Los Angeles, CA 90089 USA
> > > ++
> > >
> > >
> > >
> > >
> > >
> > >
> > > -Original Message-
> > > From: kevin slote 
> > > Reply-To: "dev@tika.apache.org" 
> > > Date: Tuesday, September 30, 2014 at 8:52 AM
> > > To: "dev@tika.apache.org" 
> > > Subject: OCR with tika-server
> > >
> > > >Hello all,
> > > >
> > > >I have been testing out the integration of tika with tesseract.
> > > >I was wondering if there is  a way to get tika-server to run with
> > > >tesseract's OCR capabilities?
> > > >
> > > >Best
> > > >
> > > >Kevin Slote
> > >
> > >
> >
>


Re: OCR with tika-server

2014-09-30 Thread kevin slote
I am working on ubuntu 10.4. and I am having some trouble.
Tesseract is installed correctly, but just doing a clone from the repo and
installing with maven, I am getting some errors.

This is before I did anything with tesseract installed.

Failed tests:   testPPTXOCR(org.apache.tika.parser.ocr.TesseractOCRTest):
Check for the image's text.
  testDOCXOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
  testPDFOCR(org.apache.tika.parser.ocr.TesseractOCRTest)

Next I hard coded the tesseractPath:

I went into the TesseractOCRConfig.java and hard coded 'tesseractPath.'
The all tests passed and it built successfully, but then I went to post
some tiff's to the server.
That didn't work. So I tried adding some System.out.println("hello world")
 (a little crude I know) inside the unit tests to confirm that tesseract
was working correctly.  It looks like something happens in the unit test in
TesseractOCRTest.java
on the line that says TesseractOCRConfig config = new
TesseractOCRConfig();. Printing to stdout before works, but I get nothing
after. That happens before the assumeTrue(canRun(config));. So an exception
is not get raised.

Then once everything is built, ocr does not work.  That was why I figured I
would ask to see if I missed some sort of configuration step in building it.

Thanks a ton.





On Tue, Sep 30, 2014 at 2:57 PM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Dear Kevin,
>
> Sure, it already works :) 1.7-SNAPSHOT.
>
> See this wiki page:
>
> https://wiki.apache.org/tika/TikaOCR
>
> I¹d be happy to discuss more.
>
> Thanks!
>
> Cheers,
> Chris
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>
>
>
>
> -Original Message-
> From: kevin slote 
> Reply-To: "dev@tika.apache.org" 
> Date: Tuesday, September 30, 2014 at 8:52 AM
> To: "dev@tika.apache.org" 
> Subject: OCR with tika-server
>
> >Hello all,
> >
> >I have been testing out the integration of tika with tesseract.
> >I was wondering if there is  a way to get tika-server to run with
> >tesseract's OCR capabilities?
> >
> >Best
> >
> >Kevin Slote
>
>


OCR with tika-server

2014-09-30 Thread kevin slote
Hello all,

I have been testing out the integration of tika with tesseract.
I was wondering if there is  a way to get tika-server to run with
tesseract's OCR capabilities?

Best

Kevin Slote


Re: Hi all,

2014-09-20 Thread kevin slote
Hello,

Are the variables contained in the .img files in the metadata fields?  It
sounds like those Apache projects you mentioned are exactly what you need.

Best

On Sat, Sep 20, 2014 at 8:06 AM, Michael Wechner 
wrote:

> Hi
>
> I would say that Tika is the right framework to start with in order to
> extract data from
> your images.
>
> What exactly do you mean with data analysis?
>
> What is your current process?
>
> Greeting from Zurich, Switzerland :-)
>
> Michael
>
> Am 20.09.14 10:55, schrieb Antonio Gracia Berná:
> > Hi,
> >
> > My name is Antonio Gracia, I'm a postdoctoral researcher at the Physics
> > Institute, Space Research and Planetary Sciences (Bern, Switzerland).
> We're
> > currently working as data analyst in 3 ESA (European Space Agency)
> > projects: BepiColombo, ExoMars and Rosetta. Thus, we're dealing with
> > massive amounts of data collected from spacecrafts (.img files). These
> data
> > usually contain pictures, variables about physical properties of
> celestial
> > bodies and spacecrafts, as well as other metadata.
> >
> > I've been recently in contact with Dr. Chris Mattmann (by
> recommendation),
> > and this is why I've decided to introduce myself here, to ask you several
> > issues.
> >
> > From your point of view, which of the different Apache technologies
> (e.g.,
> > Apache Tika, Apache Nutch, Apache Lucene and Solr, Apache Gora, and
> Apache
> > OODT) do you think it would be useful to make the data analysis and
> > processing easier? and if it is possible, why?
> >
> > Thank you very much and nice to meet you all,
> > Kind regards
> > AGB
> >
> >
> >
> >
>
>


Re: [jira] [Commented] (TIKA-93) OCR support

2014-08-21 Thread kevin slote
Is Tesseract in the trunk?  If so where can I find it?  Also, Petr, would
you mind posting your tika-config.xml?


On Wed, Aug 20, 2014 at 3:36 AM, Petr Vas (JIRA)  wrote:

>
> [
> https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14103563#comment-14103563
> ]
>
> Petr Vas commented on TIKA-93:
> --
>
> No problem )
>
> > OCR support
> > ---
> >
> > Key: TIKA-93
> > URL: https://issues.apache.org/jira/browse/TIKA-93
> > Project: Tika
> >  Issue Type: New Feature
> >  Components: parser
> >Reporter: Jukka Zitting
> >Assignee: Chris A. Mattmann
> >Priority: Minor
> > Fix For: 1.7
> >
> > Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch,
> TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch,
> TesseractOCR_Tyler.patch, TesseractOCR_Tyler_v2.patch, testOCR.docx,
> testOCR.pdf, testOCR.pptx
> >
> >
> > I don't know of any decent open source pure Java OCR libraries, but
> there are command line OCR tools like Tesseract (
> http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to
> extract text content (where available) from image files.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)
>


Re: [jira] [Comment Edited] (TIKA-93) OCR support

2014-08-12 Thread kevin slote
Will the tesseract support be for unix as well as windows?


On Tue, Aug 12, 2014 at 8:01 AM, Petr Vas (JIRA)  wrote:

>
> [
> https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093987#comment-14093987
> ]
>
> Petr Vas edited comment on TIKA-93 at 8/12/14 12:00 PM:
> 
>
> [~chrismattmann], do you know when we can expect this OCR parser to appear
> in Tika's SVN repo?
>
>
> was (Author: yonyonson):
> [~chrismattmann], do you know when we can expect this OCR parser to appear
> in released version (i.e. is there any expected release date for Tika 1.7)?
> Would there be any RC / beta version that can be used?
>
> I can see that previous versions of Tika used to be released each half
> year or so and it puts 1.7 release date somewhere in Feb 2015. Does it
> sounds right?
>
> > OCR support
> > ---
> >
> > Key: TIKA-93
> > URL: https://issues.apache.org/jira/browse/TIKA-93
> > Project: Tika
> >  Issue Type: New Feature
> >  Components: parser
> >Reporter: Jukka Zitting
> >Assignee: Chris A. Mattmann
> >Priority: Minor
> > Fix For: 1.7
> >
> > Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch,
> TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch,
> TesseractOCR_Tyler.patch, TesseractOCR_Tyler_v2.patch, testOCR.docx,
> testOCR.pdf, testOCR.pptx
> >
> >
> > I don't know of any decent open source pure Java OCR libraries, but
> there are command line OCR tools like Tesseract (
> http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to
> extract text content (where available) from image files.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)
>


Re: Compress algorithm 'implode' not parsed.

2014-08-04 Thread kevin slote
Sorry I haven't responded to this.  I tried updating the pom to have the
latest version of compress and it didn't change anything.  I tested this on
1.5.  I just cloned the git repository and will test this again on the
latest version of the code.


On Thu, Jul 31, 2014 at 12:26 PM, Nick Burch  wrote:

> On Thu, 31 Jul 2014, kevin slote wrote:
>
>> Point being, Tika-1.5 uses apache-commons-compress 1-5. According to the
>> Apache compress jira ticket below, Apache compress can
>>
>
> Trunk currently uses Commons Compress 1.8, can you try with that?
>
> (Tika 1.6 should be out within about a week, based on trunk)
>
> Nick
>


Compress algorithm 'implode' not parsed.

2014-07-31 Thread kevin slote
Hi every one.

I would like to talk about a compression algorithm that doesn't get parsed
by Tika yet, but could be.  The compression algorithm is called 'implode'
and there is a patch for Apache-compress that can handle this particular
compression algorithm that is not yet leveraged by Tika.

There is a unit test in Tika in ZipParserTest.java:

It is a test to demonstrate that just the names of files get extracted when
the zip is compressed with the 'implode' compression algorithm.
The file moby.zip in the test data is compressed with this type of
compression.



/**

 * Test case for the ability of the ZIP parser to extract the name of

 * a ZIP entry even if the content of the entry is unreadable due to an

 * unsupported compression method.

 *

 * *@see* https://issues.apache.org/jira/browse/TIKA-346";
>TIKA-346

 */

@Test

*public* *void* testUnsupportedZipCompressionMethod() *throws*
Exception {

String content = *new* Tika().parseToString(

ZipParserTest.*class*.getResourceAsStream(

"/test-documents/moby.zip"));

*assertTrue*(content.contains("README"));

}


The implode compression algorithm is an old proprietary compression
algorithm that used to be used by PKZIP in the '80's.

It uses Shannon Fano coding, which has fallen out of favor since huffman
coding is more efficient.

Point being, Tika-1.5 uses apache-commons-compress 1-5. According to
the Apache compress jira ticket below, Apache compress can

handle this compression method for compress version greater than 1.7. I was
wondering, if I wrote a patch for this if I could contribute to the tika or
if this was worthy of being opened as an issue.




https://issues.apache.org/jira/browse/COMPRESS-115

http://en.wikipedia.org/wiki/Shannon%E2%80%93Fano_coding

https://issues.apache.org/jira/browse/COMPRESS-115


Re: Expected output

2014-06-28 Thread kevin slote
Possibly nothing.  That was part of my question.  I was asking if data like
this was to be expected.  %99 of the time, tika-server returns data that is
formatted more like standard csv output.  I have only ever seen metadata
returned like this once before.  Usually, meta-data is just data about the
data.  This looks more like just data to me. That was why I thought I would
ask.


On Sat, Jun 28, 2014 at 4:41 AM, Lewis John Mcgibbney <
lewis.mcgibb...@gmail.com> wrote:

> Hi Kevin,
>
> On Fri, Jun 27, 2014 at 7:56 AM,  wrote:
>
> >
> > Subject: Expected output
> > Hello everyone.  I have a question about the expected output for tika.  I
> > am working on integrating my python application with tika-server.  One of
> > the test files for unit test produces this for the metadata.  The test
> file
> > is test.he5,
> > and the way I call tika is just to send this file to
> > http://localhost:9998/meta while tika-serve-1.5 is running.
> >
> > Should I expect csv formatted data that occasionally has long strings of
> > text with many line breaks?
> >
> >
> > I am unsure to this question...
> What is wrong?
>


Expected output

2014-06-27 Thread kevin slote
Hello everyone.  I have a question about the expected output for tika.  I
am working on integrating my python application with tika-server.  One of
the test files for unit test produces this for the metadata.  The test file
is test.he5,
and the way I call tika is just to send this file to
http://localhost:9998/meta while tika-serve-1.5 is running.

Should I expect csv formatted data that occasionally has long strings of
text with many line breaks?


"StartUTC","2009-05-02T00:00:00.00Z"
"InstrumentName","MLS Aura"
"LastMAF","6114076"
"ProcessLevel","L2"
"GranuleYear","2009"
"OrbitNumber","25509"
"GranuleDayOfYear","122"
"VerticalCoordinate","Pressure","Pressure","Pressure"
"HostName"," "
"EndUTC","2009-05-02T23:59:59.99Z"
"GranuleDay","2"
"cdm_data_type","PROFILE"
"PCF1","#
# filename:
# PCF.relB0
#
# description:
#   Process Control File (PCF)
#
# notes:
#
# This file supports the Release B version of the toolkit.
#   It is intended for use with toolkit version ""TK_VERSION_STRING"".
#
#   The logical IDs 1-10999 (inclusive) are reserved for internal
#   Toolkit/ECS usage, DO NOT add logical IDs with these values.
#
# Please treat this file as a master template and make copies of it
# for your own testing. Note that the Toolkit installation script
#   sets PGS_PC_INFO_FILE to point to this master file by default.
#   Remember to reset the environment variable PGS_PC_INFO_FILE to
#   point to the instance of your PCF.
#
#   The toolkit will not interpret environment variables specified
#   in this file (e.g. ~/database/$OSTYPE/TD is not a valid reference).
#   The '~' character, however, when appearing in a reference WILL be
#   replaced with the value of the environment variable PGSHOME.
#
#   The PCF file delivered with the toolkit should be taken as a
#   template.  User entries should be added as necessary to this
#   template.  Existing entries may (in some cases should) be altered
#   but generally should not be commented out or deleted.  A few
#   entries may not be needed by all users and can in some cases
#   be commented out or deleted.  Such entries should be clearly
#   identified in the comment(s) preceding the entry/entries.
#
#   Entries preceded by the comment: (DO NOT REMOVE THIS ENTRY)
#   are deemed especially critical and should not be removed for
#   any reason (although the values of the various fields of such an
#   entry may be configurable).
#
# ---
?   SYSTEM RUNTIME PARAMETERS
# ---
#
#
# This section contains unique identifiers used to track instances of
# a PGE run, versions of science software, etc.  This section must
# contain exactly two entries.  These values will be inserted by
# ECS just before a PGE is executed.  At the SCF the values may be set
# to anything but these values are not normally user definable and user
# values will be ignored/overwritten at the DAAC.
#
#
#
# Production Run ID - unique production instance identifier
# (DO NOT REMOVE THIS ENTRY)
# ---
1
# ---
# Software ID - unique software configuration identifier
# (DO NOT REMOVE THIS ENTRY)
# ---
1
#
?   PRODUCT INPUT FILES
#
#
# This section is intended for standard product inputs, i.e., major
# input files such as Level 0 data files.
#
# Each logical ID may have several file instances, as given by the
# version number in the last field.
#
#
#
# Next non-comment line is the default location for PRODUCT INPUT FILES
# WARNING! DO NOT MODIFY THIS LINE unless you have relocated these
# data set files to the location specified by the new setting.
!  /workops/jobs/science/1241373300.02916
#
#-
# Test input files
#-
900|job.PCF|1
901|l2cf.0223|/science1
2|emls-signals.dat|/science1
20001|MLS-Aura_L2Cal-AAAP_v2-0-0_d000.txt|/science/l2cal1
20002|MLS-Aura_L2Cal-Filters_v1-5-0_d000.txt|/science/l2cal1
20003|MLS-Aura_L2Cal-DACSFilters_v1-5-1_d000.txt|/science/l2cal1
20004|MLS-Aura_L2Cal-PFG_v2-0-4_d000.txt|/science/l2cal1
20005|PFAData_R1A_v2-0-5.h5|/science/l2cal1
20006|PFAData_R1B_v2-0-5.h5|/science/l2cal1
20007|PFAData_R2_v2-0-5.h5|/science/l2cal1
20008|PFAData_R3_v2-0-5.h5|/science/l2cal1
20009|PFAData_R4_v2-0-6.h5|/science/l2cal1
20010|PFAData_R5H_v2-0-5.h5|/science/l2cal1
20011|PFAData_R5V_v2-0-5.h5|/s

Re: [jira] [Commented] (TIKA-1350) OutlookPSTParser: Unknown message type: IPM.Note

2014-06-23 Thread kevin slote
Do you already have this issue fixed?  I encountered something similar to
this and already worked it out.


On Mon, Jun 23, 2014 at 1:03 PM, Nick Burch  wrote:

> On Mon, 23 Jun 2014, kevin slote wrote:
>
>> What tika version will have the pst support?
>>
>
> See TIKA-623 - PST support is already in trunk, and will be included in
> Tika 1.6 when that gets released
>
> Nick
>


Re: [jira] [Commented] (TIKA-1350) OutlookPSTParser: Unknown message type: IPM.Note

2014-06-23 Thread kevin slote
What tika version will have the pst support?


On Mon, Jun 23, 2014 at 4:23 AM, Hong-Thai Nguyen (JIRA) 
wrote:

>
> [
> https://issues.apache.org/jira/browse/TIKA-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14040519#comment-14040519
> ]
>
> Hong-Thai Nguyen commented on TIKA-1350:
> 
>
> Richard Johnson (author of java-pstlib) is trying deploy new version 0.8.1
> to Maven Center (ref.
> https://issues.sonatype.org/browse/OSSRH-8965?focusedCommentId=260254&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-260254
> )
>
> When this work done, we can upgrade to 0.8.1 in Tika dependence to get fix.
>
> > OutlookPSTParser: Unknown message type: IPM.Note
> > 
> >
> > Key: TIKA-1350
> > URL: https://issues.apache.org/jira/browse/TIKA-1350
> > Project: Tika
> >  Issue Type: Bug
> >  Components: parser
> >Affects Versions: 1.7
> >Reporter: Jonathan Evans
> >  Labels: libpst, parser, pst
> > Fix For: 1.7
> >
> >   Original Estimate: 0.2h
> >  Remaining Estimate: 0.2h
> >
> > When parsing some emails in a PST file I get the error "Unknown message
> type: IPM.Note" preventing them from being parsed. This is because of an
> extra null byte at the end of the message class string.
> > This has been fixed in version 0.8.1 of java-libpst so a version bump is
> all that is required.
> > https://github.com/rjohnsondev/java-libpst/issues/14
> > I would attempt to do this myself but I am unsure how to open a pull
> request with SVN.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)
>


tika-server returning ill formed xml in 1.3

2014-04-10 Thread kevin slote
Hey everyone, I am using tika-server-1.3 to extract metadata from an image
and instead of the regular comma separated values it returns large blogs of
xml inside of the csv type data.

Here is what the returned data looks like:

"Compression Lossless","true"
"Text TextEntry","keyword=XML:com.adobe.xmp, value=
   http://www.w3.org/1999/02/22-rdf-syntax-ns#"";>
  http://purl.org/dc/elements/1.1/"";>
 

 
  
   
, language=, compression=none"
"Chroma BlackIsZero","true"



Is this an expected output or a bug?  I have never seen the metadata look
like this.


Re: problem with embedded OLE attachments

2013-10-17 Thread kevin slote
Thanks,  that works!


On Thu, Oct 17, 2013 at 9:58 AM, Nick Burch  wrote:

> On Thu, 17 Oct 2013, kevin slote wrote:
>
>> I do.  But, when I deleted the jars from my classpath, I got the same
>> error.
>>
>
> You haven't got them all then. See the POI FAQ
> <http://poi.apache.org/faq.**html#faq-N10006<http://poi.apache.org/faq.html#faq-N10006>>
> for how to check what jar you're really using
>
>
>  Additionally, is there a work around if I needed to have poi in my
>> program?
>>
>
> Either use a custom classloader, or better yet just use the latest version
> of Tika which uses the latest version of POI, and have everything run
> against the most recent POI release
>
> Nick
>


Re: problem with embedded OLE attachments

2013-10-17 Thread kevin slote
Ok, I forgot to get rid of poi scratchpad and it worked without the errors.
 Is there a work though if I need to use poi elsewhere in my code?


On Thu, Oct 17, 2013 at 9:53 AM, kevin slote  wrote:

> I do.  But, when I deleted the jars from my classpath, I got the same
> error.
> Additionally, is there a work around if I needed to have poi in my program?
>
>
> On Thu, Oct 17, 2013 at 8:25 AM, Nick Burch  wrote:
>
>> On Thu, 17 Oct 2013, kevin slote wrote:
>>
>>> Hi, I was trying to parse a word file with an embedded OLE attachment
>>> and I got this error...
>>>
>>> Caused by: java.lang.IllegalAccessError: tried to access method
>>> org.apache.poi.POIDocument.<**init>(Lorg/apache/poi/poifs/**
>>> filesystem/DirectoryNode;)V
>>> from class org.apache.tika.parser.**microsoft.WordExtractor
>>>
>>
>> Most likely you have mis-matched jars on your classpath. You need to
>> ensure you have the same copy of POI that came with the version of Tika
>> that you're using, and don't have any other ones. I suspect you either are
>> using a different set of POI jars, or have two sets
>>
>> Nick
>>
>
>


Re: problem with embedded OLE attachments

2013-10-17 Thread kevin slote
I do.  But, when I deleted the jars from my classpath, I got the same error.
Additionally, is there a work around if I needed to have poi in my program?


On Thu, Oct 17, 2013 at 8:25 AM, Nick Burch  wrote:

> On Thu, 17 Oct 2013, kevin slote wrote:
>
>> Hi, I was trying to parse a word file with an embedded OLE attachment and
>> I got this error...
>>
>> Caused by: java.lang.IllegalAccessError: tried to access method
>> org.apache.poi.POIDocument.<**init>(Lorg/apache/poi/poifs/**
>> filesystem/DirectoryNode;)V
>> from class org.apache.tika.parser.**microsoft.WordExtractor
>>
>
> Most likely you have mis-matched jars on your classpath. You need to
> ensure you have the same copy of POI that came with the version of Tika
> that you're using, and don't have any other ones. I suspect you either are
> using a different set of POI jars, or have two sets
>
> Nick
>


problem with embedded OLE attachments

2013-10-17 Thread kevin slote
Hi, I was trying to parse a word file with an embedded OLE attachment and I
got this error...


Caused by: java.lang.IllegalAccessError: tried to access method
org.apache.poi.POIDocument.(Lorg/apache/poi/poifs/filesystem/DirectoryNode;)V
from class org.apache.tika.parser.microsoft.WordExtractor


I dug into the source code and noticed that most of the methods in
microsoft.WordExtractor are set to private or protected.

I was wondering if the folks on this list thought that could be the source
of the error.  It would make sense to me since the IllegalAccessError get's
thrown when two instances of the same class have been loaded by the class
loader in the jvm and the classes have been irreconcilably changed.  The
usual work around for this error is to set the methods in the class that is
causing the error to public.  But what does everyone else think?


Re: Excel files with "holes" in the cell sequence

2013-10-08 Thread kevin slote
The last time I parsed spreadsheets with POI, I found a lot of
functionality to render the layout of the spreadsheet in css.  Does anyone
think that that would be a worthy endeavor or feasible?  I would love to
become a committer to tika.


On Tue, Oct 8, 2013 at 9:14 AM, Nick Burch  wrote:

> Hi All
>
> The Excel file formats (.xls and .xlsx) are somewhat sparse formats, and
> where a cell has never been used it generally doesn't get written to the
> file. (Being a Microsoft format, there are exceptions to this...).
> Currently, if you parse a file with cells at A1 B1 F1 G1, then Tika will
> give you back a table with just 4 columns in, squashing the gaps.
>
> Within POI, there is optional logic to detect these gaps, and generate
> dummy cells to let you know that something was missed. So, if we wanted,
> with not too much work we could detect and handle these
>
> However, I'm not sure if that's something we should be doing or not? What
> do people think - should we be doing that level of processing before
> generating the SAX events, or would that be a step too far?
>
> Nick
>


Re: problem with the inputstream after calling the detect(InputStream in) method

2013-09-30 Thread kevin slote
Well, there was no error during runtime, it was just that the data was
erased.  After debugging it with a print statement,
System.out.println(in.read());,  I discovered that the InputStream was
being erased after I called the detect(InputStream in) method.


On Mon, Sep 30, 2013 at 11:19 AM, Sergey Beryozkin wrote:

> Hi
>
> On 30/09/13 15:49, kevin slote wrote:
>
>> Ok, thanks.  That was my problem.  Also, I read your book and enjoyed it
>> very much.  Is this the forum where I could bring up an issue I found with
>> the Tika-JAX-RS server?
>>
>>  What kind of issue are you seeing ?
>
> Sergey
>
>
>> On Mon, Sep 30, 2013 at 10:45 AM, Jukka Zitting 
>> **wrote:
>>
>>  Hi,
>>>
>>> On Mon, Sep 30, 2013 at 10:36 AM, kevin slote  wrote:
>>>
>>>> InputStream in= attachment.getFileInputStream(**);
>>>> [...]
>>>> String mime = tika.detect(in);
>>>>
>>>
>>> See the javadocs [1]: "If the document stream supports the mark
>>> feature, then the stream is marked and reset to the original position
>>> before this method returns"
>>>
>>> I believe the stream you're using does not support the mark feature
>>> (see [2]), which makes it impossible for Tika to restore the original
>>> state of the stream once type detection is done.
>>>
>>> Using BufferedInputStream [3] should fix your problem:
>>>
>>>  InputStream in= new
>>> BufferedInputStream(**attachment.getFileInputStream(**));
>>>
>>> [1]
>>> http://tika.apache.org/1.4/**api/org/apache/tika/Tika.html#**
>>> detect(java.io.InputStream)<http://tika.apache.org/1.4/api/org/apache/tika/Tika.html#detect(java.io.InputStream)>
>>> [2]
>>> http://docs.oracle.com/javase/**7/docs/api/java/io/**
>>> InputStream.html#mark(int)<http://docs.oracle.com/javase/7/docs/api/java/io/InputStream.html#mark(int)>
>>> [3]
>>> http://docs.oracle.com/javase/**7/docs/api/java/io/**
>>> BufferedInputStream.html<http://docs.oracle.com/javase/7/docs/api/java/io/BufferedInputStream.html>
>>>
>>> BR,
>>>
>>> Jukka Zitting
>>>
>>>
>>
>


Re: problem with the inputstream after calling the detect(InputStream in) method

2013-09-30 Thread kevin slote
Ok, thanks.  That was my problem.  Also, I read your book and enjoyed it
very much.  Is this the forum where I could bring up an issue I found with
the Tika-JAX-RS server?


On Mon, Sep 30, 2013 at 10:45 AM, Jukka Zitting wrote:

> Hi,
>
> On Mon, Sep 30, 2013 at 10:36 AM, kevin slote  wrote:
> > InputStream in= attachment.getFileInputStream();
> > [...]
> > String mime = tika.detect(in);
>
> See the javadocs [1]: "If the document stream supports the mark
> feature, then the stream is marked and reset to the original position
> before this method returns"
>
> I believe the stream you're using does not support the mark feature
> (see [2]), which makes it impossible for Tika to restore the original
> state of the stream once type detection is done.
>
> Using BufferedInputStream [3] should fix your problem:
>
> InputStream in= new
> BufferedInputStream(attachment.getFileInputStream());
>
> [1]
> http://tika.apache.org/1.4/api/org/apache/tika/Tika.html#detect(java.io.InputStream)
> [2]
> http://docs.oracle.com/javase/7/docs/api/java/io/InputStream.html#mark(int)
> [3]
> http://docs.oracle.com/javase/7/docs/api/java/io/BufferedInputStream.html
>
> BR,
>
> Jukka Zitting
>


problem with the inputstream after calling the detect(InputStream in) method

2013-09-30 Thread kevin slote
Hi,  I have been using tika for a while now without any problems and I am a
big fan of the software.  I wanted to do my part and report what I suspect
might be a bug.

My code uses two different libraries, javaMail,  java-libpst, and I am unit
testing with dumpster.  When I send the email, the last unit test that I
built with dumbpster was to make sure that all of the attachments were
appended correctly, this failed.  After doing some nitty and gritty
debugging, I discovered that if I positioned a
System.out.println(in.read());   directly before where I was calling tika,
it would yield the correct number on the console.  However, if I used the
same command after where tiks was called for this case, it read -1.

public void sendAsEmail(PSTMessage email, String parent, String dir)
throws IOException, MessagingException, PSTException {
String subject = email.getSubject();
String to = primaryRecipientsEmail(email);
String from = email.getSenderEmailAddress();
if (!isValidEmailAddress(from)) {
from = "emptyfromstr...@placeholder.com";
}
Properties props = new Properties();
props.put("mail.transport.protocol", "smtp");
props.put("mail.smtp.host", "localhost");
props.put("mail.smtp.auth", "false");
props.put("mail.debug", "false");
props.put("mail.smtp.port", "3025");//change back to 25

Session session = Session.getDefaultInstance(props);

Transport transport = session.getTransport("smtp");
transport.connect();

Message message = new MimeMessage(session);
message.addHeader("Parent-Info", parent);
message.addHeader("directory", dir);
message.setSubject(subject);
messageBodyPart.setText(email.getBody());
multipart.addBodyPart(messageBodyPart);
message.setFrom(new InternetAddress(from));
message.setRecipients(Message.RecipientType.TO,
InternetAddress
.parse(to));

try {
String transportHeaders = email.getTransportMessageHeaders();
String[] headers = parseTransporHeaders(transportHeaders);
for (String header : headers) {
messageBodyPart.addHeaderLine(header);
multipart.addBodyPart(messageBodyPart);
}
} catch (Exception e) {
log.info("missing chunk is transport headers: " + e);
}
try {
   if(email.hasAttachments()){
 int attachmentIndex = 0;
 while (attachmentIndex < email.getNumberOfAttachments()) {
 PSTAttachment attachment = email.getAttachment(attachmentIndex);
 InputStream in= attachment.getFileInputStream();
 if (attachment.getAttachMethod() !=
PSTAttachment.ATTACHMENT_METHOD_EMBEDDED
   && attachment.getAttachMethod() !=
PSTAttachment.ATTACHMENT_METHOD_OLE) {

  String filename = attachment.getFilename();
 String mime = tika.detect(in); //here is where I
called tika for use in a method that has since been depreciated.
 messageBodyPart = new MimeBodyPart();
 messageBodyPart.attachFile(file);
 messageBodyPart.setFileName(filename);
 multipart.addBodyPart(messageBodyPart);

 } else {
 log.info("not base 64 file: " + attachment.getFilename());
 }
 in.close();
 attachmentIndex++;
 }
}
}catch(Exception e){
log.info("failed attaching file to "+e);
}
 message.setContent(multipart);
transport.sendMessage(message, message.getAllRecipients());
transport.close();
}

Following the advice of Ken Krugler, I though I would share this on this
list to see if it was an error in my code or an issue in tika.


[jira] [Commented] (TIKA-1164) InputStream get modified by content type detection

2013-09-27 Thread Kevin Slote (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13780128#comment-13780128
 ] 

Kevin Slote commented on TIKA-1164:
---

Thanks Ken.  I can do that. Where can I find said user's list? 

> InputStream get modified by content type detection
> --
>
> Key: TIKA-1164
> URL: https://issues.apache.org/jira/browse/TIKA-1164
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.4
> Environment: Windows 7 / Eclipse Kepler / Tomcat 7 / JavaSE 7
>Reporter: Joël Royer
>Priority: Blocker
>
> I'm using Tika for content type detection after file upload.
> After tika detection, file content is modified (not the same size compared to 
> original uploaded file).
> Here is my code:
> {code}
> AutoDetectParser parser = new AutoDetectParser();
> Detector detector = parser.getDetector();
> Metadata md = new Metadata();
> md.add(Metadata.RESOURCE_NAME_KEY, uploadedFilename);
> md.add(Metadata.CONTENT_TYPE, uploadedFileContentType);
> MediaType type = detector.detect(new BufferedInputStream(is), md);
> {code}
> Before detection, file size is correct.
> After detection, file size is lower than original.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1164) InputStream get modified by content type detection

2013-09-27 Thread Kevin Slote (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13780105#comment-13780105
 ] 

Kevin Slote commented on TIKA-1164:
---

I recently joined this page to report this same bug.  I encountered this the 
other day.  If I applied the detect(InputStream in) method.  It erased the 
InputStream.  

> InputStream get modified by content type detection
> --
>
> Key: TIKA-1164
> URL: https://issues.apache.org/jira/browse/TIKA-1164
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.4
> Environment: Windows 7 / Eclipse Kepler / Tomcat 7 / JavaSE 7
>Reporter: Joël Royer
>Priority: Blocker
>
> I'm using Tika for content type detection after file upload.
> After tika detection, file content is modified (not the same size compared to 
> original uploaded file).
> Here is my code:
> {code}
> AutoDetectParser parser = new AutoDetectParser();
> Detector detector = parser.getDetector();
> Metadata md = new Metadata();
> md.add(Metadata.RESOURCE_NAME_KEY, uploadedFilename);
> md.add(Metadata.CONTENT_TYPE, uploadedFileContentType);
> MediaType type = detector.detect(new BufferedInputStream(is), md);
> {code}
> Before detection, file size is correct.
> After detection, file size is lower than original.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Tika JAX-RS server port bug

2013-09-16 Thread Kevin Slote
Hi.  My name is Kevin Slote and I am looking to report a bug that I think I
found in the Tika JAX-RS server.
I was not really sure where I should report this.  
In the source code for TikaServerCli.java in the svn repository, there is
this block of code

  int port = DEFAULT_PORT;

  if (line.hasOption("port")) {
port = Integer.valueOf(line.getOptionValue("port"));
  }

which lets the user specify the port from the command line.  But a couple of
lines down there is this line that sets the post number.

sf.setAddress("http://localhost:"; + MsgServerClient.DEFAULT_PORT + "/");

I am pretty sure it should read port instead of DEFAULT_PORT.  This explains
a bug that I experienced where I could not get it to run on any port except
port 9998 (the default port).



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Tika-JAX-RS-server-port-bug-tp4090355.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.