Re: Analyzers and JSON

Aaron Mefford Fri, 13 Mar 2015 11:12:23 -0700

Have you looked at the StandAloneRunner included with that plugin?

I would practice with that, seeing first if it can extract the content,
then seeing if it can extract the content from your base64 encoded version
of the document.  When that is working, I suspect you should at that point
be able to do what you are hoping.


However, while this plugin aims to make it easier, it does not make it more
efficient.  You have mentioned many times that you have a large number of
documents to process, and it sounds like you think that by avoiding putting
the contents of the document into the JSON you are being more efficient.
Instead you have opted to put the entire document base64 encoded into the
json, which is far less efficient.

Base64 encoding increases the size of a document.  Depending on the
document format, it may also be increasing the size of the actual text.
However if you use Tika to extract the text yourself, not via plugin, then
put that text into the json, then gzip and post the json, that will be the
optimal way to post your documents for indexing.  It also gives you the
greatest level of control, and will allow you to use the bulk API.

One note is that ElasticSearch has a maximum HTTP Post size by default.
http://www.elastic.co/guide/en/elasticsearch/reference/current/modules-http.html

http.max_content_length

If you are posting large documents you may exceed this, especially if you
are using the bulk API.

If your concern is that you need to use PHP, then you do have an issue.
This should be written in Java to fully leverage Tika.  Writing it in Java
will also allow you to leverage the Node API for writing to ElasticSearch.
All this will make your loading far more efficient than trying to stay in
PHP.  If PHP is the only language you know, it might be time to learn
another.  You should not be afraid to learn another language, you might
find it is easier than what you have been doing so far.  If I had a
requirement to do this in PHP, after significant objections to the
requirement with adequate explanation that it was the wrong way to do it, I
would then pursue finding alternatives to tika that will work in PHP.  I
see there are extractors for the doc format, docx is xml in a zip file so
that can be extracted, there are other options.  Worst case you could call
a command line Tika to extract, then post using PHP, though this will be
slow.

The real point is that in order for ElasticSearch to index your content,
you need to show it the content.  Putting that content into JSON is not
only a good way to do that, it is the way it is done with ElasticSearch.
You should stop looking for an alternative.  Even the plugin your are using
will ultimately put the content into JSON and send it to ElasticSearch.
This does not mean that you have to store the full content of the document
in ElasticSearch, your mappings on your index can take care of that.  It
also does not mean that you have to retrieve the full content in your
search results, your queries can take care of that if your mappings do not.



On Fri, Mar 13, 2015 at 11:49 AM, Austin Harmon <[email protected]>
wrote:

> There is a plugin called mapper attachments:
> https://github.com/elastic/elasticsearch-mapper-attachments This plugin
> is supposed to use Tika to index the content of documents but it doesn't
> seem to be working correctly. I base64 encode the documents but it comes
> back as null when I decode it.
> On Friday, March 13, 2015 at 11:38:38 AM UTC-5, Aaron Mefford wrote:
>>
>> Not certain what you are referring to so I expect not. I have used the
>> elasticsearch mappings, but I cant see how those would directly integrate
>> with Tika.
>>
>> On Fri, Mar 13, 2015 at 10:35 AM, Austin Harmon <[email protected]>
>> wrote:
>>
>>> Thank you for the information. This going to be very difficult I can
>>> tell. Do you have experience with the mapper attachment?
>>>
>>> On Friday, March 13, 2015 at 11:15:18 AM UTC-5, Aaron Mefford wrote:
>>>>
>>>> Your going to have the same issue with SOLR, putting the contents in to
>>>> XML which is even heavier than JSON.
>>>>
>>>> I wish that I had some more experience using Tika, I do not.  I am
>>>> aware of its capabilities but have not had reason to myself.
>>>>
>>>> I see what you are saying about others not having the same issue, but
>>>> what you must realize is that most users are not indexing that type of
>>>> document.  They are indexing events, database records, web pages and so
>>>> on.  It is a very small subset that index things like word docs and pdfs.
>>>>
>>>> On Fri, Mar 13, 2015 at 9:42 AM, Austin Harmon <[email protected]>
>>>> wrote:
>>>>
>>>>> Thank you for the information. I've been trying to use the mapper
>>>>> attachment which has Apache Tika built into it. I am just surprised and
>>>>> confused that so many companies use elasticsearch but yet it is so
>>>>> difficult to index the contents of a document. If I need to index the
>>>>> contents of documents then would it be easier and more efficient to switch
>>>>> over to Apache Solr? As I said I have 2TB of data so it isn't efficient 
>>>>> for
>>>>> me to manually input each document so it can be indexed with specific 
>>>>> JSON.
>>>>> If you have any experience with Solr please let me know if it would be a
>>>>> good solution to my problem.
>>>>>
>>>>> thanks,
>>>>> Austin
>>>>>
>>>>> On Thursday, March 12, 2015 at 4:04:29 PM UTC-5, Aaron Mefford wrote:
>>>>>>
>>>>>> Take a look at Apache Tika http://tika.apache.org/
>>>>>> <http://www.google.com/url?q=http%3A%2F%2Ftika.apache.org%2F&sa=D&sntz=1&usg=AFQjCNFq7mCziZJJYGt9JOe_w89GwPFWng>.
>>>>>> It will allow you to extract the contents of the documents for indexing,
>>>>>> this is outside of the scope of the ElasticSearch indexing.  A good tool 
>>>>>> to
>>>>>> make these files downloadable is also out of scope, but I'll answer to 
>>>>>> what
>>>>>> is in scope.  You need to put the files some where that they can be
>>>>>> accessed by a URL.  Any webserver is capable of this, of course your 
>>>>>> needs
>>>>>> may very but this isnt the list for those questions.  Once you have a URL
>>>>>> that the document can be accessed by, include that in your indexing of 
>>>>>> the
>>>>>> document so that you can point to that URL in your search results.
>>>>>>
>>>>>> I am sure there are other options out there for extracting the
>>>>>> contents of word documents, Apache Tika is one that is frequently used 
>>>>>> for
>>>>>> this purpose though.
>>>>>>
>>>>>> On Thu, Mar 12, 2015 at 2:56 PM, Austin Harmon <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Okay so I have a large amount of data 2 TB and its all microsoft
>>>>>>> office documents and pdfs and emails. What is the best way to go about
>>>>>>> indexing the body of these documents so making the contents of the 
>>>>>>> document
>>>>>>> searchable. I tried to use the php client but that isn't helping and I 
>>>>>>> know
>>>>>>> there are ways to convert files in php but is there nothing available 
>>>>>>> that
>>>>>>> takes in these types of documents? I tried the file_get_contents 
>>>>>>> function
>>>>>>> in php but it only takes in text documents. Also would you know of a 
>>>>>>> good
>>>>>>> tool or a method to make the files that are searched downloadable?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Austin
>>>>>>>
>>>>>>>
>>>>>>> On Thursday, March 12, 2015 at 12:26:13 PM UTC-5, [email protected]
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Yes you need to include all the text you want indexed and
>>>>>>>> searchable as part of the JSON.
>>>>>>>>
>>>>>>>> How else would you expect ElasticSearch to receive the data?
>>>>>>>>
>>>>>>>> Regarding large scale production environments, this is why
>>>>>>>> ElasticSearch scales out.
>>>>>>>>
>>>>>>>> Aaron
>>>>>>>>
>>>>>>>> On Wednesday, March 11, 2015 at 12:50:25 PM UTC-6, Austin Harmon
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> I'm trying to get an understand of the how to have full text
>>>>>>>>> search on the document and have the body of the document be considered
>>>>>>>>> during search. I understand how to do the mapping and use analyzers 
>>>>>>>>> but
>>>>>>>>> what I don't understand is how they get the body of the document. If 
>>>>>>>>> your
>>>>>>>>> fields are file name, file size, file path, file type how do the 
>>>>>>>>> analyzers
>>>>>>>>> get the body of the document. Surely you wouldn't have to put the 
>>>>>>>>> body of
>>>>>>>>> every document into the JSON, that is how I've seen it done in all the
>>>>>>>>> examples I've seen but that doesn't make sense for large scale 
>>>>>>>>> production
>>>>>>>>> environments. If someone could please give me some  insight as to how 
>>>>>>>>> this
>>>>>>>>> process works it would be greatly appreciated.
>>>>>>>>>
>>>>>>>>> Thank you,
>>>>>>>>> Austin Harmon
>>>>>>>>>
>>>>>>>>  --
>>>>>>> You received this message because you are subscribed to a topic in
>>>>>>> the Google Groups "elasticsearch" group.
>>>>>>> To unsubscribe from this topic, visit https://groups.google.com/d/to
>>>>>>> pic/elasticsearch/mG2k23vbzXQ/unsubscribe.
>>>>>>> To unsubscribe from this group and all its topics, send an email to
>>>>>>> [email protected].
>>>>>>> To view this discussion on the web visit
>>>>>>> https://groups.google.com/d/msgid/elasticsearch/41516b36-18e
>>>>>>> 3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com
>>>>>>> <https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>
>>>>>>  --
>>>>> You received this message because you are subscribed to a topic in the
>>>>> Google Groups "elasticsearch" group.
>>>>> To unsubscribe from this topic, visit https://groups.google.com/d/to
>>>>> pic/elasticsearch/mG2k23vbzXQ/unsubscribe.
>>>>> To unsubscribe from this group and all its topics, send an email to
>>>>> [email protected].
>>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>>> msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40goo
>>>>> glegroups.com
>>>>> <https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>  --
>>> You received this message because you are subscribed to a topic in the
>>> Google Groups "elasticsearch" group.
>>> To unsubscribe from this topic, visit https://groups.google.com/d/
>>> topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
>>> To unsubscribe from this group and all its topics, send an email to
>>> [email protected].
>>> To view this discussion on the web visit https://groups.google.com/d/
>>> msgid/elasticsearch/e904808f-0e66-44c2-980c-bc3a0af22951%
>>> 40googlegroups.com
>>> <https://groups.google.com/d/msgid/elasticsearch/e904808f-0e66-44c2-980c-bc3a0af22951%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
> You received this message because you are subscribed to a topic in the
> Google Groups "elasticsearch" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/0b4f70b8-bcd7-4c66-ad72-b0a478332e36%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/0b4f70b8-bcd7-4c66-ad72-b0a478332e36%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAF9vEEpEro1b4ny%3DAbzRMU1LCFx-v5fnMxU1zz4rKQa7p6Oqgw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: Analyzers and JSON

Reply via email to