Re: index design for web activity

Chen Wang Wed, 14 Jan 2015 10:07:13 -0800

Adrien,
Is there a more clear version of the video record? I can barely see the 
slides, and don't quite get the idea of entity-centric..
Does it mean, for my user case, to maintain a single user document that 
contains list of activities, and during the index time, just simply update 
the list of this property?
something like:
{
 _source:{
     customer_id: 123
    browse: [{item1, time1},{item2, time2}],
   purchase: [{item1,time1},{item2, time2}],


}
}

during the index time, I just update the browse/purchase list?
Then my query basically becomes flat.

Is my understanding correct?
Chen


On Sunday, December 21, 2014 at 1:54:48 PM UTC-8, Adrien Grand wrote:
>
>
>
> On Sat, Dec 20, 2014 at 12:53 AM, Chen Wang <[email protected] 
> <javascript:>> wrote:
>
>> Hey Guys, 
>> Wanna seek your suggestions on the index design for web activities.
>> Lets say I have browse data,  online purchase data, and store purchase 
>> data, and I will need to save a year of them.
>> For browse data, a year of data is around 80G , online purchase data is 
>> around 50G, and offline data is around 1T.
>>
>> I have to do query like, e.g, find all the customers who browsed item A 
>> in the past X months, and also online purchased B in the past Y month. 
>> Originally I am using complicated parent/child structure, and that 
>> sometimes results in very bad performance. and I store all browse 
>> data/online purchase/store purchase in one index distributed to 7 shards.
>>
>
> Parent/child is indeed slow. Can you somehow denormalize your data to make 
> queries faster?
>  
>
>> I have 7 machines with 128G each, and 1T hard disk.
>>
>> Now, I am trying to save each of those type of data into its own index, 
>> say browse_v1, onlinepurchase_v1, storepurchase_v1. Since its time based 
>> data, how should I decide to break them into monthly , or simply yearly? 
>> for browse(70G)/online purchase(50G), i think i can just use one index and 
>> one shard for them,. or should I break them into monthly data instead? 
>> breaking into monthly indexes gives me the flexibility of adding/removing 
>> data, but it also will decrease the query performance, right? (search 
>> against 1 index now becomes search against 12 indexes).
>>
>> For store data(1T) apparently I have to break them into at least monthly 
>> index, but each monthly index still contains around 100G data. With my 
>> current cluster, how many shards should I allocate to each monthly index? I 
>> am also concerned about the query performance. 
>>
>> Then since I am now storing them into separate indexes, to achieve the 
>> query I want, I will need to do application level join. Is this the common 
>> way to handle such user case?
>>
>
> As much as possible, you should try to design you documents in such a way 
> that you don't need to perform joins at search time. Would it be possible 
> for you to adopt a more "entity-centric" approach at indexing time? 
> http://www.elasticsearch.org/videos/entity-centric-indexing-london-meetup-sep-2014/
>  
>
>> I know I should perform some testing first, but hope someone may have 
>> similar experience in handling this and could provide some guidance.
>>
>
> The Elasticsearch book has a chapter about "designing for scale" that 
> gives good advices around modeling the data and chosing the right shard 
> size and numbers of shards: 
> http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/scale.html
>
> -- 
> Adrien Grand
>  

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/9c6ac540-2d77-49de-85b4-7fd1574ff2ed%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: index design for web activity

Reply via email to