z11373 wrote:
Thanks Billie/Josh! That's indeed fixing the issue, the scan now returns
instantly!!

So when we scan the whole table and filtering by column family, Accumulo
still has to go through all rows (ordered by the key), and check if the
particular item has specific column family, and in my case since they are
intermingled, the data I am looking for could be somewhere in the middle or
in the end of the rfile, am I right?

I did another experiment, if I specify -b and -e, then it also returned
instantly (this before I moved them to different group and compact), which
does make sense, because Accumulo could narrow down to specific ranges, and
then filter them by column family.

I have another follow up question, does it mean I have to create new
locality group for each column family since I wouldn't know how big/small
the data belong to that cf in advance?

Btw, we shard the customers by putting their id as column family, so we'll
add new column family whenever new customer onboard. I think the case which
we have to scan the table with cf without specifying ranges may be rare (or
perhaps never, except if I run it from shell), but I am worried if this can
become perf bottleneck if I don't set them to separate locality group.

This strikes me as very odd. Sharding is the process of distribution some data set across multiple nodes. The only way this is done in Accumulo is by the row, not the column family. If you want fast, point-lookups by customer, you'd want this customer ID in the row. If that's a non-starter for some reason, this is a case where you'd want to implement a secondary index (usually as a separate table) that does have the customer ID in the row which then points to the row+colfam in your "data" table.

e.g. say your data is sharded/hashed/whatever by date.

20151006_1 cust_id_1:attr1 => value
20151006_1 cust_id_1:attr2 => value

You would make a second table which has something like

cust_id_1 : => 20151006_1

Where you have an empty colfam/colqual. There are ways you could also use these extra field to perform extra filtering.

Ultimately, locality groups are meant to have coarse grouping of "types of data" together rather than quick random access over an entire dataset. Does that make sense?

Another question, when running setgroups command, it looks like I have to
set for all of them, even I just added new cf. For example, let say I did:
setgroups mygroup=cf1,cf2 -t mytable
compact -t mytable -w

Then later I need to add cf3 to the same group, I have to do "setgroups
mygroup=cf1,cf2,c3 -t mytable", instead of just "setgroups mygroup=cf3 -t
mytable"

It'd be nice if I can do the latter :-) What happens with cf1 and cf2 if I
did the latter, does it mean they are coming back to default group again
after compaction?


Thanks,
Z




--
View this message in context: 
http://apache-accumulo.1065345.n5.nabble.com/scan-command-hung-tp15286p15324.html
Sent from the Developers mailing list archive at Nabble.com.

Reply via email to