Re: Analyzers and JSON

Aaron Mefford Thu, 12 Mar 2015 14:07:39 -0700

Take a look at Apache Tika http://tika.apache.org/.  It will allow you to
extract the contents of the documents for indexing, this is outside of the
scope of the ElasticSearch indexing.  A good tool to make these files
downloadable is also out of scope, but I'll answer to what is in scope.
You need to put the files some where that they can be accessed by a URL.
Any webserver is capable of this, of course your needs may very but this
isnt the list for those questions.  Once you have a URL that the document
can be accessed by, include that in your indexing of the document so that
you can point to that URL in your search results.


I am sure there are other options out there for extracting the contents of
word documents, Apache Tika is one that is frequently used for this purpose
though.

On Thu, Mar 12, 2015 at 2:56 PM, Austin Harmon <[email protected]>
wrote:

> Okay so I have a large amount of data 2 TB and its all microsoft office
> documents and pdfs and emails. What is the best way to go about indexing
> the body of these documents so making the contents of the document
> searchable. I tried to use the php client but that isn't helping and I know
> there are ways to convert files in php but is there nothing available that
> takes in these types of documents? I tried the file_get_contents function
> in php but it only takes in text documents. Also would you know of a good
> tool or a method to make the files that are searched downloadable?
>
> Thanks,
> Austin
>
>
> On Thursday, March 12, 2015 at 12:26:13 PM UTC-5, [email protected]
> wrote:
>>
>> Yes you need to include all the text you want indexed and searchable as
>> part of the JSON.
>>
>> How else would you expect ElasticSearch to receive the data?
>>
>> Regarding large scale production environments, this is why ElasticSearch
>> scales out.
>>
>> Aaron
>>
>> On Wednesday, March 11, 2015 at 12:50:25 PM UTC-6, Austin Harmon wrote:
>>>
>>> Hello,
>>>
>>> I'm trying to get an understand of the how to have full text search on
>>> the document and have the body of the document be considered during search.
>>> I understand how to do the mapping and use analyzers but what I don't
>>> understand is how they get the body of the document. If your fields are
>>> file name, file size, file path, file type how do the analyzers get the
>>> body of the document. Surely you wouldn't have to put the body of every
>>> document into the JSON, that is how I've seen it done in all the examples
>>> I've seen but that doesn't make sense for large scale production
>>> environments. If someone could please give me some  insight as to how this
>>> process works it would be greatly appreciated.
>>>
>>> Thank you,
>>> Austin Harmon
>>>
>>  --
> You received this message because you are subscribed to a topic in the
> Google Groups "elasticsearch" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAF9vEEpvt4ZkL%3DZ4_tXv0S9xWs-f-pzZae3iMpFHyRmhDH1SBg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: Analyzers and JSON

Reply via email to