Ben, this is a difficult use case for the Lucene index on which ES is 
built, because in essence you have two primary objects of interest, and a 
relationship between them.  

The parent/child relation is useful if you have a (document) tree, so you 
could use it to relate products and upcs, I think but in this instance you 
also have a many-many from retailer to product (and upc), and the 
parent/child thing won't help you there.

We have a similar problem in my company with access control to documents.  
Our customers acquire access to documents: we have a lot of documents and a 
lot of customers (although many fewer than documents), and we want 
customers to be able to search only for products they have access to.  
We're only filtering on a single customer, where you want to use a set, but 
the principle is similar.

The usual approach to denormalizing, and what we  have been doing is to 
build queries using all the product ids the customer has access to, but 
this can only scale up to about 1000 products per customer since the 
queries get too many terms and slow down eventually.  The only reason we 
were able to do this up until now is that many of our sites have a small 
number of huge products that have lots of documents in them.  But now we 
are getting sites with a lot of medium-sized products, and we are having 
queries blow up with too many terms.

Then we thought well we'll denormalize by indexing the customer access 
relation in with the document: basically tag every product document with 
the customers that have it. This works great for search since you only need 
a single query term per customer, but places a huge burden on the indexer, 
which becomes complex and has to do a lot more updates.  Especially if you 
are going to index every hour, there will presumably be a lot of change, 
although possibly incremental? We haven't actually tried this in 
production, but I have been doing some calculations and I think the 
indexing cost will be prohibitive.  The answer is hard to be definite about 
because it is highly dependent on the distribution of the data.  But this 
is the most natural answer for a search index like Lucene/ES.

Currently I'm experimenting with generating product groups automatically.  
I believe our customers tend to buy the same groups of products, and if 
that's true, we can index those groups in the products, and record the 
relation of customer->group, but this is kind of complicated and not 
working yet in a form where I can share: sorry.

If you get your queries working OK, I think you will be able to retrieve 
customer ids, but the devil is in the details, of course.

-Mike

On Friday, February 14, 2014 10:01:22 PM UTC-5, Ben Hirsch wrote:
>
> We are considering using Elastic Search for an upcoming project. I have 
> done quite a bit of research on the API and am curious about a few things 
> as they relate to a specific but very necessary use-case for the project. 
> If we cannot satisy this use-case I do not think ES will be right for us.
>
> My research has led to me to using a *parent/child relationship* and the 
> *has_child* query (I also looked at the 'nested' and 'inner object'). But 
> I am not sure if this is the best approach as I am still wrapping my head 
> around how to best strategize for Elastic Search and denormalize my data. 
> We currently have a relational database in place and are planning on 
> setting up Elastic Search to run along-side this DB as our search 
> repository.
>
> The use-case is as follows:
>
> * We are storing product information (300,000+ products) as type 'product'.
> * We are also storing inventory data for 20,000+ retailers.
> * Each product has a set of UPCs and each retailer has a list of UPCs they 
> carry along with the quantities in stock.
> * The 'product_retailers' type stores ALL of the retailers who carry the 
> parent product. This will be re-indexed very often (at least once an hour 
> for each product)
> * The document model I am proposing we use:
>
> $ curl -XPUT 'http://localhost:9200/products/product/1' -d '{
> {
> 'name' : 'Foo',
> 'description' : 'Bar...',
> ...
> }'
>
> $ curl -XPUT 'http://localhost:9200/products/product_retailers/_mapping' 
> -d '
> {
>     {
>   "product_retailers":{
>     "_parent":{
>       "type" : "product"
>     }
>   }
> }
> }'
>
> $ curl -XPUT 'http://localhost:9200/products/product_retailers/?parent=1' 
> -d '
> {
> {
> "id" : 888, // the retailer id
> "upcs" : {
>
>      {
>      "code" : 123456789012,
>      "quantity" : 22
>      },
>      {
>      "code" : 123456789013,
>      "quantity" : 19
>      },
>      {
>      "code" : 123456789014,
>      "quantity" : 27
>      },
>
>      ...
>     }
> },
> {
> "id" : 889, // the retailer id
> "upcs" : {
>
>      {
>      "code" : 123456789012,
>      "quantity" : 11
>      },
>      {
>      "code" : 123456789013,
>      "quantity" : 2
>      },
>      {
>      "code" : 123456789014,
>      "quantity" : 1
>      },
>
>      ...
>     }
> }
> }'
>
>
>
> * We need to be able to filter product results (based on keyword matches) 
> filtered against a set of retailer IDs for whom have the product in stock.
> * Another way to put it, given a list of retailer ids and a search 
> keyphrase we need to be able to return matching products. 
> * A huge bonus would be to ALSO include the data about the matching 
> retailers in the result set. 
>
> Is this even possible with ES? Am I going about modeling my data correctly 
> so that it can scale well to the quantity of items we are storing.
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/1f792dcf-63a0-43dc-9a88-668a652a3ca1%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to