Adrien,
Is there a more clear version of the video record? I can barely see the
slides, and don't quite get the idea of entity-centric..
Does it mean, for my user case, to maintain a single user document that
contains list of activities, and during the index time, just simply update
the list of this property?
something like:
{
_source:{
customer_id: 123
browse: [{item1, time1},{item2, time2}],
purchase: [{item1,time1},{item2, time2}],
}
}
during the index time, I just update the browse/purchase list?
Then my query basically becomes flat.
Is my understanding correct?
Chen
On Sunday, December 21, 2014 at 1:54:48 PM UTC-8, Adrien Grand wrote:
>
>
>
> On Sat, Dec 20, 2014 at 12:53 AM, Chen Wang <[email protected]
> <javascript:>> wrote:
>
>> Hey Guys,
>> Wanna seek your suggestions on the index design for web activities.
>> Lets say I have browse data, online purchase data, and store purchase
>> data, and I will need to save a year of them.
>> For browse data, a year of data is around 80G , online purchase data is
>> around 50G, and offline data is around 1T.
>>
>> I have to do query like, e.g, find all the customers who browsed item A
>> in the past X months, and also online purchased B in the past Y month.
>> Originally I am using complicated parent/child structure, and that
>> sometimes results in very bad performance. and I store all browse
>> data/online purchase/store purchase in one index distributed to 7 shards.
>>
>
> Parent/child is indeed slow. Can you somehow denormalize your data to make
> queries faster?
>
>
>> I have 7 machines with 128G each, and 1T hard disk.
>>
>> Now, I am trying to save each of those type of data into its own index,
>> say browse_v1, onlinepurchase_v1, storepurchase_v1. Since its time based
>> data, how should I decide to break them into monthly , or simply yearly?
>> for browse(70G)/online purchase(50G), i think i can just use one index and
>> one shard for them,. or should I break them into monthly data instead?
>> breaking into monthly indexes gives me the flexibility of adding/removing
>> data, but it also will decrease the query performance, right? (search
>> against 1 index now becomes search against 12 indexes).
>>
>> For store data(1T) apparently I have to break them into at least monthly
>> index, but each monthly index still contains around 100G data. With my
>> current cluster, how many shards should I allocate to each monthly index? I
>> am also concerned about the query performance.
>>
>> Then since I am now storing them into separate indexes, to achieve the
>> query I want, I will need to do application level join. Is this the common
>> way to handle such user case?
>>
>
> As much as possible, you should try to design you documents in such a way
> that you don't need to perform joins at search time. Would it be possible
> for you to adopt a more "entity-centric" approach at indexing time?
> http://www.elasticsearch.org/videos/entity-centric-indexing-london-meetup-sep-2014/
>
>
>> I know I should perform some testing first, but hope someone may have
>> similar experience in handling this and could provide some guidance.
>>
>
> The Elasticsearch book has a chapter about "designing for scale" that
> gives good advices around modeling the data and chosing the right shard
> size and numbers of shards:
> http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/scale.html
>
> --
> Adrien Grand
>
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/9c6ac540-2d77-49de-85b4-7fd1574ff2ed%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.