Take a look at Apache Tika http://tika.apache.org/. It will allow you to extract the contents of the documents for indexing, this is outside of the scope of the ElasticSearch indexing. A good tool to make these files downloadable is also out of scope, but I'll answer to what is in scope. You need to put the files some where that they can be accessed by a URL. Any webserver is capable of this, of course your needs may very but this isnt the list for those questions. Once you have a URL that the document can be accessed by, include that in your indexing of the document so that you can point to that URL in your search results.
I am sure there are other options out there for extracting the contents of word documents, Apache Tika is one that is frequently used for this purpose though. On Thu, Mar 12, 2015 at 2:56 PM, Austin Harmon <[email protected]> wrote: > Okay so I have a large amount of data 2 TB and its all microsoft office > documents and pdfs and emails. What is the best way to go about indexing > the body of these documents so making the contents of the document > searchable. I tried to use the php client but that isn't helping and I know > there are ways to convert files in php but is there nothing available that > takes in these types of documents? I tried the file_get_contents function > in php but it only takes in text documents. Also would you know of a good > tool or a method to make the files that are searched downloadable? > > Thanks, > Austin > > > On Thursday, March 12, 2015 at 12:26:13 PM UTC-5, [email protected] > wrote: >> >> Yes you need to include all the text you want indexed and searchable as >> part of the JSON. >> >> How else would you expect ElasticSearch to receive the data? >> >> Regarding large scale production environments, this is why ElasticSearch >> scales out. >> >> Aaron >> >> On Wednesday, March 11, 2015 at 12:50:25 PM UTC-6, Austin Harmon wrote: >>> >>> Hello, >>> >>> I'm trying to get an understand of the how to have full text search on >>> the document and have the body of the document be considered during search. >>> I understand how to do the mapping and use analyzers but what I don't >>> understand is how they get the body of the document. If your fields are >>> file name, file size, file path, file type how do the analyzers get the >>> body of the document. Surely you wouldn't have to put the body of every >>> document into the JSON, that is how I've seen it done in all the examples >>> I've seen but that doesn't make sense for large scale production >>> environments. If someone could please give me some insight as to how this >>> process works it would be greatly appreciated. >>> >>> Thank you, >>> Austin Harmon >>> >> -- > You received this message because you are subscribed to a topic in the > Google Groups "elasticsearch" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com > <https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAF9vEEpvt4ZkL%3DZ4_tXv0S9xWs-f-pzZae3iMpFHyRmhDH1SBg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
