Wow thanks for your quick reply Jörg! :-)
Tagging does not consume that much space - remember, you have an inverted > index, the frequency of words does not correlate with index growth. Thank you for pointing that out, I didn't think of it! It is the easiest method to classify documents, see folksonomies. I will check folksonomies for sure, cause I am not that familiar with classication strategies. Be careful of synonym files, they are inefficient, slow, and comes with an > extra price - you have to restart the cluster each time you modify the > synonyms, and if you use synonyms in the index, you have to reindex. Maybe > you do not want that overhead. True. But in this specific case, synonyms wouldn't vary once properly set, as it is just (in the way I thought of it) a matter of reducing a few terms (if present) to either "rent" or "buy" (from my example). I realized that this is "tagging" strategy as well (sort of), except that it is just a "one term tag" each time. Stemming can not help either in document classification. I thought of stemming because of such cases: "rent a car in London", or "car rental in London". The "keyword" here are "rent" and "rental". If I did understand your tagging suggestion, I would then tag each and every "rent"-typed document with both "rent" and "rental" (the example is trivial in English but becomes a bit more complicated in French for instance, where you can see more occurrences due to verb conjugation which produce "correct" terms even though not relevant nor necessary in the search context - quite a few people would for example type "loué", past participle of "louer", because they are pronounced the same way, and both terms are correct spelling-wise), am I right? In case of of a keyword analysis, there is no stemming, right? So the tough part of this would be to try and tag documents with all the possible terms and forms so that ES could finally identify either a "rent" or a "buy" documen, am I still right? This seems pretty complicated to me, actually, much more than proccessing the query with a langage-specific analyzer (which would use stemming, right?) and comparing it to the "type" field. But again I may be totally wrong cause I am really new to ES and everything it brings. I have read a lot about it, the docs, etc, and quite a few elements tend to mix up in my mind... If you want to process natural language queries and examine the sentence > for the meaning and express the meaning in useful tags, you can try plugins > for POS tagging, e.g. > https://github.com/richardwilly98/elasticsearch-auto-tagging Oh thank you I will have a look at it, it will probably help me understand properly the pros of this approach! There are plenty of approaches in the natural language processing field, > most of them work in front of ES, not as plugins. That is really interesting. As said, I would definitely prefer a solution that would allow me to know, from the query, which type to query instead of querying over them all and then filtering. For instance, imagine that I have those two index types, "buy" with 10000 documents and "rent" with 1000000. Querying 1010000 total documents would not be a real issue if the query was expected to return "rent" typed documents. But if the expected set of documents was from "buy", querying that many documents for an original pool of just 10000 would be a real overhead (well, so I think, but I may be wrong again...) Plus, filtering would still remain "redundant" in some way, as documents are already properly stored either in "buy" or "rent": would be much better to me to use only the relevant type right away in any case... What I thought of a few minutes after I posted my original question was the following: a 2-pass process. 1. analyze the query (for instance: /test_index/_analyze?analyzer=my_custom_analyzer&text=the+text+of+the+query) => this would return the parsed result, which could contain either "rent" (from terms like "rent", "rental", "hire", "hiring",...) or "buy" (from terms like "buy", "purchase", etc.) or none of them 2. then do the real _search query on the proper index type, or on both if no "rent" or "buy" term has been found from the first pass analysis Does that process make sense? It is the only thing I can think of right now that would avoid querying several index types then filtering the matches, and avoid at the same time the use of an external process before querying my ES index... Many thanks for your help! :-) Cheers JM Le mercredi 4 mars 2015 16:25:37 UTC+1, Jörg Prante a écrit : > > Tagging does not consume that much space - remember, you have an inverted > index, the frequency of words does not correlate with index growth. > > It is the easiest method to classify documents, see folksonomies. > > Be careful of synonym files, they are inefficient, slow, and comes with an > extra price - you have to restart the cluster each time you modify the > synonyms, and if you use synonyms in the index, you have to reindex. Maybe > you do not want that overhead. > > Stemming can not help either in document classification. > > If you want to process natural language queries and examine the sentence > for the meaning and express the meaning in useful tags, you can try plugins > for POS tagging, e.g. > https://github.com/richardwilly98/elasticsearch-auto-tagging > > There are plenty of approaches in the natural language processing field, > most of them work in front of ES, not as plugins. > > Jörg > > > On Wed, Mar 4, 2015 at 4:02 PM, Jean-Marc F. <[email protected] > <javascript:>> wrote: > >> Thank you Jörg ! :-) >> >> I did think of the tag approach: it is very close to the first scheme I >> described in my question, that is: querying over the two types then >> filtering. It still seems to me that it is an overhead that can be avoided? >> (not critical with a few documents but might become when both types' size >> increase...) >> >> I discarded the tag approach for another reason too: the need to tag each >> "rent" or "buy" document with always the same words/expressions, which >> would enflate the data size and would not leverage ES' intrinsic full-text >> abilities (such as stemming, synonym handling, etc.). I do think that, in >> that context, working on a simple field/tag ("type" or even "_type" if >> feasible?) with the proper analyzer and synonym file would be more >> efficient and less error prone. >> >> But thank you again anyway for your feedback on this topic, it makes me >> feel more confident as I did envisage this approach - letting me think I am >> not totally lost ^^ >> >> Cheers, >> JM >> >> Le mercredi 4 mars 2015 12:19:17 UTC+1, Jörg Prante a écrit : >>> >>> My suggestion is, instead of selecting a unique type, you should tag >>> documents in the index with a given vocabulary, and at query time, you >>> could match certain phrases in the query text with that vocabulary in order >>> to build a filter clause. >>> >>> Jörg >>> >>> >>> On Wed, Mar 4, 2015 at 11:10 AM, Jean-Marc F. <[email protected]> wrote: >>> >>>> Now that I have written my question: would be a 2 pass job? First pass: >>>> send an "analyze" query to get the proper term "rent" or "buy" (or both if >>>> none), then second pass => query the proper type? >>>> >>>> >>>> Le mercredi 4 mars 2015 11:07:43 UTC+1, Jean-Marc F. a écrit : >>>> >>>>> Hi everyone, >>>>> >>>>> I am pretty new to ES and need some advice for the following use case: >>>>> I have a unique input field for user search (Google like). In my test >>>>> index, I have two different types, let's call them "rent" and "buy". What >>>>> I >>>>> would like to achieve is leverage ES's full-text powerful features to >>>>> determine which index type to query depending on the query (part of it). >>>>> >>>>> For instance, for a query such as "rent a motorcycle in Paris" or >>>>> "hire a flat in Rome" => is there a way to have ES "know" it should look >>>>> into the "rent" type? >>>>> >>>>> I thought of a first possibility: query both types (/rent,buy/_search) >>>>> then filter on a (quite redundant) "type" field created each time a >>>>> document is indexed, this "type" field being applied the proper >>>>> analyzers/synonyms to always simplify things to "rent" or "buy". (or more >>>>> directly the "_type" field but I don't think you can apply analysis to >>>>> it, >>>>> can you?) >>>>> >>>>> The "cons" to this approach is that I have to query both the rent and >>>>> the buy types then filter to narrow the results to the expected type of >>>>> documents. The "pros" is that it should not be complicated to have it >>>>> work >>>>> properly. >>>>> >>>>> Now, I am wondering if it would be possible to have ES "figure out" >>>>> what index to query right after analysis? In a process like: query => >>>>> analysis => "rent" or "buy" term identified => perform on the right index >>>>> type. >>>>> The pros would be that you obviously query one index type thus don't >>>>> need to filter afterwards: smaller data set + no filtering, should be >>>>> lighter/faster. >>>>> The cons: I do not think that ES can do it. >>>>> >>>>> Another scenario would be to handle a first, app specific analysis >>>>> step before querying ES just to determine "rent" or "buy". With this >>>>> example it would not be that tough (two types, a few synonyms/a bit of >>>>> stemming to take into account, etc.), but with a more complex setup it >>>>> would become a real nightmare - not to mention the fact that not using >>>>> ES's >>>>> abilities would be quite a pity, actually... >>>>> >>>>> I would really appreciate your thoughts on this, you all :-) >>>>> >>>>> Thanks >>>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "elasticsearch" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To view this discussion on the web visit https://groups.google.com/d/ >>>> msgid/elasticsearch/f5de1e5b-c2e6-4cd2-9019-8e520979b6a2% >>>> 40googlegroups.com >>>> <https://groups.google.com/d/msgid/elasticsearch/f5de1e5b-c2e6-4cd2-9019-8e520979b6a2%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/1d6222c6-6d5a-4b6e-b68f-d7d9d415fa23%40googlegroups.com >> >> <https://groups.google.com/d/msgid/elasticsearch/1d6222c6-6d5a-4b6e-b68f-d7d9d415fa23%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1e43146c-70ce-4197-9274-7d46546e96d7%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
