Re: Analyzers and JSON

Austin Harmon Fri, 13 Mar 2015 10:51:19 -0700

There is a plugin called mapper 
attachments: https://github.com/elastic/elasticsearch-mapper-attachments 
This plugin is supposed to use Tika to index the content of documents but 
it doesn't seem to be working correctly. I base64 encode the documents but 
it comes back as null when I decode it. 
On Friday, March 13, 2015 at 11:38:38 AM UTC-5, Aaron Mefford wrote:
>
> Not certain what you are referring to so I expect not. I have used the 
> elasticsearch mappings, but I cant see how those would directly integrate 
> with Tika.
>
> On Fri, Mar 13, 2015 at 10:35 AM, Austin Harmon <[email protected] 
> <javascript:>> wrote:
>
>> Thank you for the information. This going to be very difficult I can 
>> tell. Do you have experience with the mapper attachment?
>>
>> On Friday, March 13, 2015 at 11:15:18 AM UTC-5, Aaron Mefford wrote:
>>>
>>> Your going to have the same issue with SOLR, putting the contents in to 
>>> XML which is even heavier than JSON.
>>>
>>> I wish that I had some more experience using Tika, I do not.  I am aware 
>>> of its capabilities but have not had reason to myself.  
>>>
>>> I see what you are saying about others not having the same issue, but 
>>> what you must realize is that most users are not indexing that type of 
>>> document.  They are indexing events, database records, web pages and so 
>>> on.  It is a very small subset that index things like word docs and pdfs.
>>>
>>> On Fri, Mar 13, 2015 at 9:42 AM, Austin Harmon <[email protected]> 
>>> wrote:
>>>
>>>> Thank you for the information. I've been trying to use the mapper 
>>>> attachment which has Apache Tika built into it. I am just surprised and 
>>>> confused that so many companies use elasticsearch but yet it is so 
>>>> difficult to index the contents of a document. If I need to index the 
>>>> contents of documents then would it be easier and more efficient to switch 
>>>> over to Apache Solr? As I said I have 2TB of data so it isn't efficient 
>>>> for 
>>>> me to manually input each document so it can be indexed with specific 
>>>> JSON. 
>>>> If you have any experience with Solr please let me know if it would be a 
>>>> good solution to my problem. 
>>>>
>>>> thanks,
>>>> Austin
>>>>
>>>> On Thursday, March 12, 2015 at 4:04:29 PM UTC-5, Aaron Mefford wrote:
>>>>>
>>>>> Take a look at Apache Tika http://tika.apache.org/ 
>>>>> <http://www.google.com/url?q=http%3A%2F%2Ftika.apache.org%2F&sa=D&sntz=1&usg=AFQjCNFq7mCziZJJYGt9JOe_w89GwPFWng>.
>>>>>   
>>>>> It will allow you to extract the contents of the documents for indexing, 
>>>>> this is outside of the scope of the ElasticSearch indexing.  A good tool 
>>>>> to 
>>>>> make these files downloadable is also out of scope, but I'll answer to 
>>>>> what 
>>>>> is in scope.  You need to put the files some where that they can be 
>>>>> accessed by a URL.  Any webserver is capable of this, of course your 
>>>>> needs 
>>>>> may very but this isnt the list for those questions.  Once you have a URL 
>>>>> that the document can be accessed by, include that in your indexing of 
>>>>> the 
>>>>> document so that you can point to that URL in your search results.
>>>>>
>>>>> I am sure there are other options out there for extracting the 
>>>>> contents of word documents, Apache Tika is one that is frequently used 
>>>>> for 
>>>>> this purpose though.
>>>>>
>>>>> On Thu, Mar 12, 2015 at 2:56 PM, Austin Harmon <[email protected]> 
>>>>> wrote:
>>>>>
>>>>>> Okay so I have a large amount of data 2 TB and its all microsoft 
>>>>>> office documents and pdfs and emails. What is the best way to go about 
>>>>>> indexing the body of these documents so making the contents of the 
>>>>>> document 
>>>>>> searchable. I tried to use the php client but that isn't helping and I 
>>>>>> know 
>>>>>> there are ways to convert files in php but is there nothing available 
>>>>>> that 
>>>>>> takes in these types of documents? I tried the file_get_contents 
>>>>>> function 
>>>>>> in php but it only takes in text documents. Also would you know of a 
>>>>>> good 
>>>>>> tool or a method to make the files that are searched downloadable?
>>>>>>
>>>>>> Thanks,
>>>>>> Austin
>>>>>>
>>>>>>
>>>>>> On Thursday, March 12, 2015 at 12:26:13 PM UTC-5, [email protected] 
>>>>>> wrote:
>>>>>>>
>>>>>>> Yes you need to include all the text you want indexed and searchable 
>>>>>>> as part of the JSON.
>>>>>>>
>>>>>>> How else would you expect ElasticSearch to receive the data?
>>>>>>>
>>>>>>> Regarding large scale production environments, this is why 
>>>>>>> ElasticSearch scales out.
>>>>>>>
>>>>>>> Aaron
>>>>>>>
>>>>>>> On Wednesday, March 11, 2015 at 12:50:25 PM UTC-6, Austin Harmon 
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I'm trying to get an understand of the how to have full text search 
>>>>>>>> on the document and have the body of the document be considered during 
>>>>>>>> search. I understand how to do the mapping and use analyzers but what 
>>>>>>>> I 
>>>>>>>> don't understand is how they get the body of the document. If your 
>>>>>>>> fields 
>>>>>>>> are file name, file size, file path, file type how do the analyzers 
>>>>>>>> get the 
>>>>>>>> body of the document. Surely you wouldn't have to put the body of 
>>>>>>>> every 
>>>>>>>> document into the JSON, that is how I've seen it done in all the 
>>>>>>>> examples 
>>>>>>>> I've seen but that doesn't make sense for large scale production 
>>>>>>>> environments. If someone could please give me some  insight as to how 
>>>>>>>> this 
>>>>>>>> process works it would be greatly appreciated.
>>>>>>>>
>>>>>>>> Thank you,
>>>>>>>> Austin Harmon
>>>>>>>>
>>>>>>>  -- 
>>>>>> You received this message because you are subscribed to a topic in 
>>>>>> the Google Groups "elasticsearch" group.
>>>>>> To unsubscribe from this topic, visit https://groups.google.com/d/to
>>>>>> pic/elasticsearch/mG2k23vbzXQ/unsubscribe.
>>>>>> To unsubscribe from this group and all its topics, send an email to 
>>>>>> [email protected].
>>>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>>>> msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40goo
>>>>>> glegroups.com 
>>>>>> <https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>>  -- 
>>>> You received this message because you are subscribed to a topic in the 
>>>> Google Groups "elasticsearch" group.
>>>> To unsubscribe from this topic, visit https://groups.google.com/d/
>>>> topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
>>>> To unsubscribe from this group and all its topics, send an email to 
>>>> [email protected].
>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>> msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%
>>>> 40googlegroups.com 
>>>> <https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>  -- 
>> You received this message because you are subscribed to a topic in the 
>> Google Groups "elasticsearch" group.
>> To unsubscribe from this topic, visit 
>> https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to 
>> [email protected] <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/e904808f-0e66-44c2-980c-bc3a0af22951%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/e904808f-0e66-44c2-980c-bc3a0af22951%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>


-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/0b4f70b8-bcd7-4c66-ad72-b0a478332e36%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Analyzers and JSON

Reply via email to