Thanks. I missed the post. Will answer there. -- David Pilato - Developer | Evangelist Elasticsearch.com @dadoonet <https://twitter.com/dadoonet> | @elasticsearchfr <https://twitter.com/elasticsearchfr> | @scrutmydocs <https://twitter.com/scrutmydocs>
> Le 13 mars 2015 à 12:41, Aaron Mefford <[email protected]> a écrit : > > He posted limited details in a separate thread. > > "mapper-attachment and base64 encoding" > > I was not asserting that it does not work, just that it may not be the best > way to handle "large number of documents". > > I suspect there is an issue with encoding or submitting the document. > > > > > On Fri, Mar 13, 2015 at 1:35 PM, David Pilato <[email protected] > <mailto:[email protected]>> wrote: > I’m a bit concerned about your « it does not work » statement. > We have only today 4 opened issues on it: > https://github.com/elastic/elasticsearch-mapper-attachments/issues > <https://github.com/elastic/elasticsearch-mapper-attachments/issues> > 1 bug and 3 feature requests. > > Could you explain a bit more what is not working? May be I missed something. > > > > -- > David Pilato - Developer | Evangelist > Elasticsearch.com <http://elasticsearch.com/> > @dadoonet <https://twitter.com/dadoonet> | @elasticsearchfr > <https://twitter.com/elasticsearchfr> | @scrutmydocs > <https://twitter.com/scrutmydocs> > > > > >> Le 13 mars 2015 à 10:49, Austin Harmon <[email protected] >> <mailto:[email protected]>> a écrit : >> >> There is a plugin called mapper attachments: >> https://github.com/elastic/elasticsearch-mapper-attachments >> <https://github.com/elastic/elasticsearch-mapper-attachments> This plugin is >> supposed to use Tika to index the content of documents but it doesn't seem >> to be working correctly. I base64 encode the documents but it comes back as >> null when I decode it. >> On Friday, March 13, 2015 at 11:38:38 AM UTC-5, Aaron Mefford wrote: >> Not certain what you are referring to so I expect not. I have used the >> elasticsearch mappings, but I cant see how those would directly integrate >> with Tika. >> >> On Fri, Mar 13, 2015 at 10:35 AM, Austin Harmon <[email protected] <>> >> wrote: >> Thank you for the information. This going to be very difficult I can tell. >> Do you have experience with the mapper attachment? >> >> On Friday, March 13, 2015 at 11:15:18 AM UTC-5, Aaron Mefford wrote: >> Your going to have the same issue with SOLR, putting the contents in to XML >> which is even heavier than JSON. >> >> I wish that I had some more experience using Tika, I do not. I am aware of >> its capabilities but have not had reason to myself. >> >> I see what you are saying about others not having the same issue, but what >> you must realize is that most users are not indexing that type of document. >> They are indexing events, database records, web pages and so on. It is a >> very small subset that index things like word docs and pdfs. >> >> On Fri, Mar 13, 2015 at 9:42 AM, Austin Harmon <[email protected] <>> wrote: >> Thank you for the information. I've been trying to use the mapper attachment >> which has Apache Tika built into it. I am just surprised and confused that >> so many companies use elasticsearch but yet it is so difficult to index the >> contents of a document. If I need to index the contents of documents then >> would it be easier and more efficient to switch over to Apache Solr? As I >> said I have 2TB of data so it isn't efficient for me to manually input each >> document so it can be indexed with specific JSON. If you have any experience >> with Solr please let me know if it would be a good solution to my problem. >> >> thanks, >> Austin >> >> On Thursday, March 12, 2015 at 4:04:29 PM UTC-5, Aaron Mefford wrote: >> Take a look at Apache Tika http://tika.apache.org/ >> <http://www.google.com/url?q=http%3A%2F%2Ftika.apache.org%2F&sa=D&sntz=1&usg=AFQjCNFq7mCziZJJYGt9JOe_w89GwPFWng>. >> It will allow you to extract the contents of the documents for indexing, >> this is outside of the scope of the ElasticSearch indexing. A good tool to >> make these files downloadable is also out of scope, but I'll answer to what >> is in scope. You need to put the files some where that they can be accessed >> by a URL. Any webserver is capable of this, of course your needs may very >> but this isnt the list for those questions. Once you have a URL that the >> document can be accessed by, include that in your indexing of the document >> so that you can point to that URL in your search results. >> >> I am sure there are other options out there for extracting the contents of >> word documents, Apache Tika is one that is frequently used for this purpose >> though. >> >> On Thu, Mar 12, 2015 at 2:56 PM, Austin Harmon <[email protected] <>> wrote: >> Okay so I have a large amount of data 2 TB and its all microsoft office >> documents and pdfs and emails. What is the best way to go about indexing the >> body of these documents so making the contents of the document searchable. I >> tried to use the php client but that isn't helping and I know there are ways >> to convert files in php but is there nothing available that takes in these >> types of documents? I tried the file_get_contents function in php but it >> only takes in text documents. Also would you know of a good tool or a method >> to make the files that are searched downloadable? >> >> Thanks, >> Austin >> >> >> On Thursday, March 12, 2015 at 12:26:13 PM UTC-5, [email protected] <> >> wrote: >> Yes you need to include all the text you want indexed and searchable as part >> of the JSON. >> >> How else would you expect ElasticSearch to receive the data? >> >> Regarding large scale production environments, this is why ElasticSearch >> scales out. >> >> Aaron >> >> On Wednesday, March 11, 2015 at 12:50:25 PM UTC-6, Austin Harmon wrote: >> Hello, >> >> I'm trying to get an understand of the how to have full text search on the >> document and have the body of the document be considered during search. I >> understand how to do the mapping and use analyzers but what I don't >> understand is how they get the body of the document. If your fields are file >> name, file size, file path, file type how do the analyzers get the body of >> the document. Surely you wouldn't have to put the body of every document >> into the JSON, that is how I've seen it done in all the examples I've seen >> but that doesn't make sense for large scale production environments. If >> someone could please give me some insight as to how this process works it >> would be greatly appreciated. >> >> Thank you, >> Austin Harmon >> >> -- >> You received this message because you are subscribed to a topic in the >> Google Groups "elasticsearch" group. >> To unsubscribe from this topic, visit >> https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe >> <https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe>. >> To unsubscribe from this group and all its topics, send an email to >> [email protected] <>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com >> >> <https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com?utm_medium=email&utm_source=footer>. >> >> For more options, visit https://groups.google.com/d/optout >> <https://groups.google.com/d/optout>. >> >> >> -- >> You received this message because you are subscribed to a topic in the >> Google Groups "elasticsearch" group. >> To unsubscribe from this topic, visit >> https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe >> <https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe>. >> To unsubscribe from this group and all its topics, send an email to >> [email protected] <>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com >> >> <https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com?utm_medium=email&utm_source=footer>. >> >> For more options, visit https://groups.google.com/d/optout >> <https://groups.google.com/d/optout>. >> >> >> -- >> You received this message because you are subscribed to a topic in the >> Google Groups "elasticsearch" group. >> To unsubscribe from this topic, visit >> https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe >> <https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe>. >> To unsubscribe from this group and all its topics, send an email to >> [email protected] <>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/e904808f-0e66-44c2-980c-bc3a0af22951%40googlegroups.com >> >> <https://groups.google.com/d/msgid/elasticsearch/e904808f-0e66-44c2-980c-bc3a0af22951%40googlegroups.com?utm_medium=email&utm_source=footer>. >> >> For more options, visit https://groups.google.com/d/optout >> <https://groups.google.com/d/optout>. >> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] >> <mailto:[email protected]>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/0b4f70b8-bcd7-4c66-ad72-b0a478332e36%40googlegroups.com >> >> <https://groups.google.com/d/msgid/elasticsearch/0b4f70b8-bcd7-4c66-ad72-b0a478332e36%40googlegroups.com?utm_medium=email&utm_source=footer>. >> For more options, visit https://groups.google.com/d/optout >> <https://groups.google.com/d/optout>. > > > -- > You received this message because you are subscribed to a topic in the Google > Groups "elasticsearch" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe > <https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe>. > To unsubscribe from this group and all its topics, send an email to > [email protected] > <mailto:[email protected]>. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/169B3761-629C-471D-9E97-07EA75473F7E%40pilato.fr > > <https://groups.google.com/d/msgid/elasticsearch/169B3761-629C-471D-9E97-07EA75473F7E%40pilato.fr?utm_medium=email&utm_source=footer>. > > For more options, visit https://groups.google.com/d/optout > <https://groups.google.com/d/optout>. > > > -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected] > <mailto:[email protected]>. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/CAF9vEEqxZ9XPD8aB0jg3xartJEUW6NAKQGe%3D7Z8_udHsUQDijA%40mail.gmail.com > > <https://groups.google.com/d/msgid/elasticsearch/CAF9vEEqxZ9XPD8aB0jg3xartJEUW6NAKQGe%3D7Z8_udHsUQDijA%40mail.gmail.com?utm_medium=email&utm_source=footer>. > For more options, visit https://groups.google.com/d/optout > <https://groups.google.com/d/optout>. -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5F337472-7F1B-462F-A9A2-A617D6F4536A%40pilato.fr. For more options, visit https://groups.google.com/d/optout.
