Re: Supporting Image-to-Text (Image Captioning) in Tika for Image MIME Types

Thamme Gowda Fri, 03 Mar 2017 20:02:04 -0800

Hi Thejan,

I tried running your code snippet on my machine. It worked!


It looks to me that you missed setting up tesseract or your setup is
incomplete.
You have to have tesseract and imagemagick installed and make it available
in $PATH  to get it work.
You can verify by using command:
$tesseract test.jpg stdout
$convert --help

I see from your path to image that that you're running it on linux. If it
is ubuntu try installing tesseract and imagemagick using apt-get.

There is some documentation on wiki [1] for setup on OSX. After you make
these changes, you are requested to get permissions to edit wiki and update
the OCR page accordingly. Please keep this in your TODO list for now :-)

Let me know if this solved your problem.

Best,
TG

[1] https://wiki.apache.org/tika/TikaOCR

*--*
*Thamme Gowda*
TG | @thammegowda <https://twitter.com/thammegowda>
~Sent via somebody's Webmail server!

On Fri, Mar 3, 2017 at 12:36 PM, Thejan Wijesinghe <
[email protected]> wrote:

> Update: not "getting null for a stream", it should be "getting nothing as
> metadata for the image"
>
> On 4 Mar 2017 01:26, "Thejan Wijesinghe" <[email protected]>
> wrote:
>
>> Hi Thamme,
>>
>> I am happy to say that I have stated working on your suggestion for creating
>> a simpler Java version of TesseractOCRParser using Tess4J. I ran a few
>> tests with the existing TesseractOCRParser and found out that I'm getting
>> null for the stream of an image although I could extract the content of the
>> image without a problem. That particular code snippet is attached below.
>> I'm not sure whether I'm missing something. This is important for me to
>> know this because I'm planning to extract metadata as well through the API
>> that I'm going to write using Tess4j.
>>
>> public static void main(final String[] args) throws IOException, 
>> SAXException, TikaException {
>>
>>     // CLI implementation
>>     File imageFile = new File("/home/thejan/Desktop/test.jpg");
>>     FileInputStream stream = new FileInputStream(imageFile);
>>     ContentHandler handler = new BodyContentHandler();
>>     Metadata metadata = new Metadata();
>>     ParseContext context = new ParseContext();
>>
>>     TesseractOCRParser tessParser = new TesseractOCRParser();
>>     tessParser.parse(stream, handler, metadata, context);
>>     stream.close();
>>     // The content gets printed correctly
>>     System.out.println(handler.toString());
>>
>>     // But I get "X-Parsed-By : org.apache.tika.parser.EmptyParser" for 
>> metadata
>>     String[] metadataNames = metadata.names();
>>
>>     for(String name : metadataNames) {
>>         System.out.println(name+ " : " + metadata.get(name));
>>     }
>>
>>
>> On Fri, Mar 3, 2017 at 6:16 AM, Thamme Gowda <[email protected]>
>> wrote:
>>
>>> Thejan,
>>>
>>> Yes, send your questions to us, and cc dev list.
>>> Looking forward to working with you!
>>>
>>> Best,
>>> TG
>>>
>>> --
>>> Thamme Gowda
>>> TG | @thammegowda
>>> ~Sent via somebody's IMAP server
>>>
>>> On Mar 2, 2017 11:50 AM, "Thejan Wijesinghe" <
>>> [email protected]>
>>> wrote:
>>>
>>> > Dear Thamme and Chris,
>>> >
>>> > I have commented on the particular JIRA page and subscribed to the
>>> > dev-mailing list as Thamme suggested. I am really interested in looking
>>> > into the challenges that Thamme has provided. Thank you for guiding me
>>> this
>>> > way. If I get any issues while working on these problems, is it
>>> alright to
>>> > contact you this way (directly mailing to you two while CCing the
>>> > dev-mail)? or is there any other suitable way of doing that? Pardon me
>>> for
>>> > asking such a question, I am really concerned about the protocol that
>>> > mailing should happen.
>>> >
>>>
>>
>>
>>
>>

Re: Supporting Image-to-Text (Image Captioning) in Tika for Image MIME Types

Reply via email to