He posted limited details in a separate thread. "mapper-attachment and base64 encoding"
I was not asserting that it does not work, just that it may not be the best way to handle "large number of documents". I suspect there is an issue with encoding or submitting the document. On Fri, Mar 13, 2015 at 1:35 PM, David Pilato <[email protected]> wrote: > I’m a bit concerned about your « it does not work » statement. > We have only today 4 opened issues on it: > https://github.com/elastic/elasticsearch-mapper-attachments/issues > 1 bug and 3 feature requests. > > Could you explain a bit more what is not working? May be I missed > something. > > > > -- > *David Pilato* - Developer | Evangelist > *Elasticsearch.com <http://Elasticsearch.com>* > @dadoonet <https://twitter.com/dadoonet> | @elasticsearchfr > <https://twitter.com/elasticsearchfr> | @scrutmydocs > <https://twitter.com/scrutmydocs> > > > > > Le 13 mars 2015 à 10:49, Austin Harmon <[email protected]> a écrit : > > There is a plugin called mapper attachments: > https://github.com/elastic/elasticsearch-mapper-attachments This plugin > is supposed to use Tika to index the content of documents but it doesn't > seem to be working correctly. I base64 encode the documents but it comes > back as null when I decode it. > On Friday, March 13, 2015 at 11:38:38 AM UTC-5, Aaron Mefford wrote: >> >> Not certain what you are referring to so I expect not. I have used the >> elasticsearch mappings, but I cant see how those would directly integrate >> with Tika. >> >> On Fri, Mar 13, 2015 at 10:35 AM, Austin Harmon <[email protected]> >> wrote: >> >>> Thank you for the information. This going to be very difficult I can >>> tell. Do you have experience with the mapper attachment? >>> >>> On Friday, March 13, 2015 at 11:15:18 AM UTC-5, Aaron Mefford wrote: >>>> >>>> Your going to have the same issue with SOLR, putting the contents in to >>>> XML which is even heavier than JSON. >>>> >>>> I wish that I had some more experience using Tika, I do not. I am >>>> aware of its capabilities but have not had reason to myself. >>>> >>>> I see what you are saying about others not having the same issue, but >>>> what you must realize is that most users are not indexing that type of >>>> document. They are indexing events, database records, web pages and so >>>> on. It is a very small subset that index things like word docs and pdfs. >>>> >>>> On Fri, Mar 13, 2015 at 9:42 AM, Austin Harmon <[email protected]> >>>> wrote: >>>> >>>>> Thank you for the information. I've been trying to use the mapper >>>>> attachment which has Apache Tika built into it. I am just surprised and >>>>> confused that so many companies use elasticsearch but yet it is so >>>>> difficult to index the contents of a document. If I need to index the >>>>> contents of documents then would it be easier and more efficient to switch >>>>> over to Apache Solr? As I said I have 2TB of data so it isn't efficient >>>>> for >>>>> me to manually input each document so it can be indexed with specific >>>>> JSON. >>>>> If you have any experience with Solr please let me know if it would be a >>>>> good solution to my problem. >>>>> >>>>> thanks, >>>>> Austin >>>>> >>>>> On Thursday, March 12, 2015 at 4:04:29 PM UTC-5, Aaron Mefford wrote: >>>>>> >>>>>> Take a look at Apache Tika http://tika.apache.org/ >>>>>> <http://www.google.com/url?q=http%3A%2F%2Ftika.apache.org%2F&sa=D&sntz=1&usg=AFQjCNFq7mCziZJJYGt9JOe_w89GwPFWng>. >>>>>> It will allow you to extract the contents of the documents for indexing, >>>>>> this is outside of the scope of the ElasticSearch indexing. A good tool >>>>>> to >>>>>> make these files downloadable is also out of scope, but I'll answer to >>>>>> what >>>>>> is in scope. You need to put the files some where that they can be >>>>>> accessed by a URL. Any webserver is capable of this, of course your >>>>>> needs >>>>>> may very but this isnt the list for those questions. Once you have a URL >>>>>> that the document can be accessed by, include that in your indexing of >>>>>> the >>>>>> document so that you can point to that URL in your search results. >>>>>> >>>>>> I am sure there are other options out there for extracting the >>>>>> contents of word documents, Apache Tika is one that is frequently used >>>>>> for >>>>>> this purpose though. >>>>>> >>>>>> On Thu, Mar 12, 2015 at 2:56 PM, Austin Harmon <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Okay so I have a large amount of data 2 TB and its all microsoft >>>>>>> office documents and pdfs and emails. What is the best way to go about >>>>>>> indexing the body of these documents so making the contents of the >>>>>>> document >>>>>>> searchable. I tried to use the php client but that isn't helping and I >>>>>>> know >>>>>>> there are ways to convert files in php but is there nothing available >>>>>>> that >>>>>>> takes in these types of documents? I tried the file_get_contents >>>>>>> function >>>>>>> in php but it only takes in text documents. Also would you know of a >>>>>>> good >>>>>>> tool or a method to make the files that are searched downloadable? >>>>>>> >>>>>>> Thanks, >>>>>>> Austin >>>>>>> >>>>>>> >>>>>>> On Thursday, March 12, 2015 at 12:26:13 PM UTC-5, [email protected] >>>>>>> wrote: >>>>>>>> >>>>>>>> Yes you need to include all the text you want indexed and >>>>>>>> searchable as part of the JSON. >>>>>>>> >>>>>>>> How else would you expect ElasticSearch to receive the data? >>>>>>>> >>>>>>>> Regarding large scale production environments, this is why >>>>>>>> ElasticSearch scales out. >>>>>>>> >>>>>>>> Aaron >>>>>>>> >>>>>>>> On Wednesday, March 11, 2015 at 12:50:25 PM UTC-6, Austin Harmon >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hello, >>>>>>>>> >>>>>>>>> I'm trying to get an understand of the how to have full text >>>>>>>>> search on the document and have the body of the document be considered >>>>>>>>> during search. I understand how to do the mapping and use analyzers >>>>>>>>> but >>>>>>>>> what I don't understand is how they get the body of the document. If >>>>>>>>> your >>>>>>>>> fields are file name, file size, file path, file type how do the >>>>>>>>> analyzers >>>>>>>>> get the body of the document. Surely you wouldn't have to put the >>>>>>>>> body of >>>>>>>>> every document into the JSON, that is how I've seen it done in all the >>>>>>>>> examples I've seen but that doesn't make sense for large scale >>>>>>>>> production >>>>>>>>> environments. If someone could please give me some insight as to how >>>>>>>>> this >>>>>>>>> process works it would be greatly appreciated. >>>>>>>>> >>>>>>>>> Thank you, >>>>>>>>> Austin Harmon >>>>>>>>> >>>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to a topic in >>>>>>> the Google Groups "elasticsearch" group. >>>>>>> To unsubscribe from this topic, visit https://groups.google.com/d/to >>>>>>> pic/elasticsearch/mG2k23vbzXQ/unsubscribe. >>>>>>> To unsubscribe from this group and all its topics, send an email to >>>>>>> [email protected]. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/elasticsearch/41516b36-18e >>>>>>> 3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com >>>>>>> <https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> >>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>> >>>>>> >>>>>> >>>>> -- >>>>> You received this message because you are subscribed to a topic in the >>>>> Google Groups "elasticsearch" group. >>>>> To unsubscribe from this topic, visit https://groups.google.com/d/to >>>>> pic/elasticsearch/mG2k23vbzXQ/unsubscribe. >>>>> To unsubscribe from this group and all its topics, send an email to >>>>> [email protected]. >>>>> To view this discussion on the web visit https://groups.google.com/d/ >>>>> msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40goo >>>>> glegroups.com >>>>> <https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> >>> -- >>> You received this message because you are subscribed to a topic in the >>> Google Groups "elasticsearch" group. >>> To unsubscribe from this topic, visit https://groups.google.com/d/ >>> topic/elasticsearch/mG2k23vbzXQ/unsubscribe. >>> To unsubscribe from this group and all its topics, send an email to >>> [email protected]. >>> To view this discussion on the web visit https://groups.google.com/d/ >>> msgid/elasticsearch/e904808f-0e66-44c2-980c-bc3a0af22951% >>> 40googlegroups.com >>> <https://groups.google.com/d/msgid/elasticsearch/e904808f-0e66-44c2-980c-bc3a0af22951%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> > -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/0b4f70b8-bcd7-4c66-ad72-b0a478332e36%40googlegroups.com > <https://groups.google.com/d/msgid/elasticsearch/0b4f70b8-bcd7-4c66-ad72-b0a478332e36%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > > > -- > You received this message because you are subscribed to a topic in the > Google Groups "elasticsearch" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/169B3761-629C-471D-9E97-07EA75473F7E%40pilato.fr > <https://groups.google.com/d/msgid/elasticsearch/169B3761-629C-471D-9E97-07EA75473F7E%40pilato.fr?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAF9vEEqxZ9XPD8aB0jg3xartJEUW6NAKQGe%3D7Z8_udHsUQDijA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
