There is a plugin called mapper attachments: https://github.com/elastic/elasticsearch-mapper-attachments This plugin is supposed to use Tika to index the content of documents but it doesn't seem to be working correctly. I base64 encode the documents but it comes back as null when I decode it. On Friday, March 13, 2015 at 11:38:38 AM UTC-5, Aaron Mefford wrote: > > Not certain what you are referring to so I expect not. I have used the > elasticsearch mappings, but I cant see how those would directly integrate > with Tika. > > On Fri, Mar 13, 2015 at 10:35 AM, Austin Harmon <[email protected] > <javascript:>> wrote: > >> Thank you for the information. This going to be very difficult I can >> tell. Do you have experience with the mapper attachment? >> >> On Friday, March 13, 2015 at 11:15:18 AM UTC-5, Aaron Mefford wrote: >>> >>> Your going to have the same issue with SOLR, putting the contents in to >>> XML which is even heavier than JSON. >>> >>> I wish that I had some more experience using Tika, I do not. I am aware >>> of its capabilities but have not had reason to myself. >>> >>> I see what you are saying about others not having the same issue, but >>> what you must realize is that most users are not indexing that type of >>> document. They are indexing events, database records, web pages and so >>> on. It is a very small subset that index things like word docs and pdfs. >>> >>> On Fri, Mar 13, 2015 at 9:42 AM, Austin Harmon <[email protected]> >>> wrote: >>> >>>> Thank you for the information. I've been trying to use the mapper >>>> attachment which has Apache Tika built into it. I am just surprised and >>>> confused that so many companies use elasticsearch but yet it is so >>>> difficult to index the contents of a document. If I need to index the >>>> contents of documents then would it be easier and more efficient to switch >>>> over to Apache Solr? As I said I have 2TB of data so it isn't efficient >>>> for >>>> me to manually input each document so it can be indexed with specific >>>> JSON. >>>> If you have any experience with Solr please let me know if it would be a >>>> good solution to my problem. >>>> >>>> thanks, >>>> Austin >>>> >>>> On Thursday, March 12, 2015 at 4:04:29 PM UTC-5, Aaron Mefford wrote: >>>>> >>>>> Take a look at Apache Tika http://tika.apache.org/ >>>>> <http://www.google.com/url?q=http%3A%2F%2Ftika.apache.org%2F&sa=D&sntz=1&usg=AFQjCNFq7mCziZJJYGt9JOe_w89GwPFWng>. >>>>> >>>>> It will allow you to extract the contents of the documents for indexing, >>>>> this is outside of the scope of the ElasticSearch indexing. A good tool >>>>> to >>>>> make these files downloadable is also out of scope, but I'll answer to >>>>> what >>>>> is in scope. You need to put the files some where that they can be >>>>> accessed by a URL. Any webserver is capable of this, of course your >>>>> needs >>>>> may very but this isnt the list for those questions. Once you have a URL >>>>> that the document can be accessed by, include that in your indexing of >>>>> the >>>>> document so that you can point to that URL in your search results. >>>>> >>>>> I am sure there are other options out there for extracting the >>>>> contents of word documents, Apache Tika is one that is frequently used >>>>> for >>>>> this purpose though. >>>>> >>>>> On Thu, Mar 12, 2015 at 2:56 PM, Austin Harmon <[email protected]> >>>>> wrote: >>>>> >>>>>> Okay so I have a large amount of data 2 TB and its all microsoft >>>>>> office documents and pdfs and emails. What is the best way to go about >>>>>> indexing the body of these documents so making the contents of the >>>>>> document >>>>>> searchable. I tried to use the php client but that isn't helping and I >>>>>> know >>>>>> there are ways to convert files in php but is there nothing available >>>>>> that >>>>>> takes in these types of documents? I tried the file_get_contents >>>>>> function >>>>>> in php but it only takes in text documents. Also would you know of a >>>>>> good >>>>>> tool or a method to make the files that are searched downloadable? >>>>>> >>>>>> Thanks, >>>>>> Austin >>>>>> >>>>>> >>>>>> On Thursday, March 12, 2015 at 12:26:13 PM UTC-5, [email protected] >>>>>> wrote: >>>>>>> >>>>>>> Yes you need to include all the text you want indexed and searchable >>>>>>> as part of the JSON. >>>>>>> >>>>>>> How else would you expect ElasticSearch to receive the data? >>>>>>> >>>>>>> Regarding large scale production environments, this is why >>>>>>> ElasticSearch scales out. >>>>>>> >>>>>>> Aaron >>>>>>> >>>>>>> On Wednesday, March 11, 2015 at 12:50:25 PM UTC-6, Austin Harmon >>>>>>> wrote: >>>>>>>> >>>>>>>> Hello, >>>>>>>> >>>>>>>> I'm trying to get an understand of the how to have full text search >>>>>>>> on the document and have the body of the document be considered during >>>>>>>> search. I understand how to do the mapping and use analyzers but what >>>>>>>> I >>>>>>>> don't understand is how they get the body of the document. If your >>>>>>>> fields >>>>>>>> are file name, file size, file path, file type how do the analyzers >>>>>>>> get the >>>>>>>> body of the document. Surely you wouldn't have to put the body of >>>>>>>> every >>>>>>>> document into the JSON, that is how I've seen it done in all the >>>>>>>> examples >>>>>>>> I've seen but that doesn't make sense for large scale production >>>>>>>> environments. If someone could please give me some insight as to how >>>>>>>> this >>>>>>>> process works it would be greatly appreciated. >>>>>>>> >>>>>>>> Thank you, >>>>>>>> Austin Harmon >>>>>>>> >>>>>>> -- >>>>>> You received this message because you are subscribed to a topic in >>>>>> the Google Groups "elasticsearch" group. >>>>>> To unsubscribe from this topic, visit https://groups.google.com/d/to >>>>>> pic/elasticsearch/mG2k23vbzXQ/unsubscribe. >>>>>> To unsubscribe from this group and all its topics, send an email to >>>>>> [email protected]. >>>>>> To view this discussion on the web visit https://groups.google.com/d/ >>>>>> msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40goo >>>>>> glegroups.com >>>>>> <https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> -- >>>> You received this message because you are subscribed to a topic in the >>>> Google Groups "elasticsearch" group. >>>> To unsubscribe from this topic, visit https://groups.google.com/d/ >>>> topic/elasticsearch/mG2k23vbzXQ/unsubscribe. >>>> To unsubscribe from this group and all its topics, send an email to >>>> [email protected]. >>>> To view this discussion on the web visit https://groups.google.com/d/ >>>> msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6% >>>> 40googlegroups.com >>>> <https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >> You received this message because you are subscribed to a topic in the >> Google Groups "elasticsearch" group. >> To unsubscribe from this topic, visit >> https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe. >> To unsubscribe from this group and all its topics, send an email to >> [email protected] <javascript:>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/e904808f-0e66-44c2-980c-bc3a0af22951%40googlegroups.com >> >> <https://groups.google.com/d/msgid/elasticsearch/e904808f-0e66-44c2-980c-bc3a0af22951%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > >
-- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0b4f70b8-bcd7-4c66-ad72-b0a478332e36%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
