[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig
[ https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000768#comment-15000768 ] Aleksey Yeschenko commented on CASSANDRA-6091: -- [~michaelsembwever] I don't know why it was moved to {{Testing}}, and I don't know if it's still relevant. Sorry for the annoying delay. At this point it will most likely not go into 2.1.x and 2.2.x (or 3.0.x), but, if still relevant for 3.x, might go into 3.2. Can you have a look/cook a proper patch, if so? > Better Vnode support in hadoop/pig > -- > > Key: CASSANDRA-6091 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6091 > Project: Cassandra > Issue Type: Improvement >Reporter: Alex Liu >Assignee: mck >Priority: Minor > Attachments: cassandra-2.0-6091.txt, cassandra-2.1-6091.txt, > trunk-6091.txt > > > CASSANDRA-6084 shows there are some issues during running hadoop/pig job if > vnodes are enable. Also the hadoop performance of vnode enabled nodes are > bad for there are so many splits. > The idea is to combine vnode splits into a big sudo splits so it work like > vnode is disable for hadoop/pig job -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig
[ https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608706#comment-14608706 ] mck commented on CASSANDRA-6091: this has been in Testing for over 3 months. wazzup? Better Vnode support in hadoop/pig -- Key: CASSANDRA-6091 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091 Project: Cassandra Issue Type: Bug Components: Hadoop Reporter: Alex Liu Assignee: mck Attachments: cassandra-2.0-6091.txt, cassandra-2.1-6091.txt, trunk-6091.txt CASSANDRA-6084 shows there are some issues during running hadoop/pig job if vnodes are enable. Also the hadoop performance of vnode enabled nodes are bad for there are so many splits. The idea is to combine vnode splits into a big sudo splits so it work like vnode is disable for hadoop/pig job -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig
[ https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14359104#comment-14359104 ] mck commented on CASSANDRA-6091: new patches on their way for both 2.0 and 2.1. (there's a silly npe in CqlRecordReader in the first patch so i've removed it, but i don't know how to transition the issue back to in progress or opened status). Better Vnode support in hadoop/pig -- Key: CASSANDRA-6091 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091 Project: Cassandra Issue Type: Bug Components: Hadoop Reporter: Alex Liu Assignee: mck CASSANDRA-6084 shows there are some issues during running hadoop/pig job if vnodes are enable. Also the hadoop performance of vnode enabled nodes are bad for there are so many splits. The idea is to combine vnode splits into a big sudo splits so it work like vnode is disable for hadoop/pig job -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig
[ https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14349029#comment-14349029 ] Alex Liu commented on CASSANDRA-6091: - yes, please add cassandra-2.1 as well. If we can cleanly merge it into trunk, we need another one for trunk. Better Vnode support in hadoop/pig -- Key: CASSANDRA-6091 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091 Project: Cassandra Issue Type: Bug Components: Hadoop Reporter: Alex Liu Assignee: mck Attachments: cassandra-2.0-6091.txt CASSANDRA-6084 shows there are some issues during running hadoop/pig job if vnodes are enable. Also the hadoop performance of vnode enabled nodes are bad for there are so many splits. The idea is to combine vnode splits into a big sudo splits so it work like vnode is disable for hadoop/pig job -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig
[ https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348865#comment-14348865 ] mck commented on CASSANDRA-6091: oh …LGTM :-) Better Vnode support in hadoop/pig -- Key: CASSANDRA-6091 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091 Project: Cassandra Issue Type: Bug Components: Hadoop Reporter: Alex Liu Assignee: mck Attachments: cassandra-2.0-6091.txt CASSANDRA-6084 shows there are some issues during running hadoop/pig job if vnodes are enable. Also the hadoop performance of vnode enabled nodes are bad for there are so many splits. The idea is to combine vnode splits into a big sudo splits so it work like vnode is disable for hadoop/pig job -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig
[ https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348468#comment-14348468 ] mck commented on CASSANDRA-6091: no idea what LGMT means :-/ do you need cassandra-2.1 and trunk patches as well? Better Vnode support in hadoop/pig -- Key: CASSANDRA-6091 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091 Project: Cassandra Issue Type: Bug Components: Hadoop Reporter: Alex Liu Assignee: mck Attachments: cassandra-2.0-6091.txt CASSANDRA-6084 shows there are some issues during running hadoop/pig job if vnodes are enable. Also the hadoop performance of vnode enabled nodes are bad for there are so many splits. The idea is to combine vnode splits into a big sudo splits so it work like vnode is disable for hadoop/pig job -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig
[ https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14347173#comment-14347173 ] Alex Liu commented on CASSANDRA-6091: - LGMT+1 Better Vnode support in hadoop/pig -- Key: CASSANDRA-6091 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091 Project: Cassandra Issue Type: Bug Components: Hadoop Reporter: Alex Liu Assignee: mck Attachments: cassandra-2.0-6091.txt CASSANDRA-6084 shows there are some issues during running hadoop/pig job if vnodes are enable. Also the hadoop performance of vnode enabled nodes are bad for there are so many splits. The idea is to combine vnode splits into a big sudo splits so it work like vnode is disable for hadoop/pig job -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig
[ https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14330467#comment-14330467 ] mck commented on CASSANDRA-6091: patch submitted. Better Vnode support in hadoop/pig -- Key: CASSANDRA-6091 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091 Project: Cassandra Issue Type: Bug Components: Hadoop Reporter: Alex Liu Assignee: Alex Liu Attachments: cassandra-2.0-6091.txt CASSANDRA-6084 shows there are some issues during running hadoop/pig job if vnodes are enable. Also the hadoop performance of vnode enabled nodes are bad for there are so many splits. The idea is to combine vnode splits into a big sudo splits so it work like vnode is disable for hadoop/pig job -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig
[ https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316544#comment-14316544 ] Jeremy Hanna commented on CASSANDRA-6091: - [~michaelsembwever] can you give a high level description of the approach you're taking? Better Vnode support in hadoop/pig -- Key: CASSANDRA-6091 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091 Project: Cassandra Issue Type: Bug Components: Hadoop Reporter: Alex Liu Assignee: Alex Liu CASSANDRA-6084 shows there are some issues during running hadoop/pig job if vnodes are enable. Also the hadoop performance of vnode enabled nodes are bad for there are so many splits. The idea is to combine vnode splits into a big sudo splits so it work like vnode is disable for hadoop/pig job -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig
[ https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316667#comment-14316667 ] Alex Liu commented on CASSANDRA-6091: - We need at least check the total estimated rows of multiple token ranges per split, instead of multiple taken ranges per node. Better Vnode support in hadoop/pig -- Key: CASSANDRA-6091 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091 Project: Cassandra Issue Type: Bug Components: Hadoop Reporter: Alex Liu Assignee: Alex Liu CASSANDRA-6084 shows there are some issues during running hadoop/pig job if vnodes are enable. Also the hadoop performance of vnode enabled nodes are bad for there are so many splits. The idea is to combine vnode splits into a big sudo splits so it work like vnode is disable for hadoop/pig job -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig
[ https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316635#comment-14316635 ] Piotr Kołaczkowski commented on CASSANDRA-6091: --- The way we dealt with this problem in Spark connector was to allow multiple token ranges per split. I don't think there is any other way, as the number of adjacent token ranges is going to drop very quickly with the size of the cluster. Better Vnode support in hadoop/pig -- Key: CASSANDRA-6091 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091 Project: Cassandra Issue Type: Bug Components: Hadoop Reporter: Alex Liu Assignee: Alex Liu CASSANDRA-6084 shows there are some issues during running hadoop/pig job if vnodes are enable. Also the hadoop performance of vnode enabled nodes are bad for there are so many splits. The idea is to combine vnode splits into a big sudo splits so it work like vnode is disable for hadoop/pig job -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig
[ https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317090#comment-14317090 ] mck commented on CASSANDRA-6091: The approach in the patch is to do allow multiple token ranges per split. We do with our custom input formats, and it is (very) effective in that it means splitSize is honoured. Handling multiple token ranges per split requires for example the code change found in CqlRecordReader whereby the reader must iterate over both rows and tokenRanges. The grouping of token rages by common location sets, so that splits again honour the splitSize, happens in AbstractColumnFamilyInputForma.collectSplits(..) Token ranges do not need to be adjacent. Everything in this patch is done client-side. Better Vnode support in hadoop/pig -- Key: CASSANDRA-6091 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091 Project: Cassandra Issue Type: Bug Components: Hadoop Reporter: Alex Liu Assignee: Alex Liu CASSANDRA-6084 shows there are some issues during running hadoop/pig job if vnodes are enable. Also the hadoop performance of vnode enabled nodes are bad for there are so many splits. The idea is to combine vnode splits into a big sudo splits so it work like vnode is disable for hadoop/pig job -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig
[ https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317223#comment-14317223 ] Alex Liu commented on CASSANDRA-6091: - One more improvement is combining adjacent token ranges into one range. But it will create some small corner ranges, so the trade off is not that good as as the number of adjacent token ranges is going to drop very quickly with the size of the cluster. Reply Better Vnode support in hadoop/pig -- Key: CASSANDRA-6091 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091 Project: Cassandra Issue Type: Bug Components: Hadoop Reporter: Alex Liu Assignee: Alex Liu CASSANDRA-6084 shows there are some issues during running hadoop/pig job if vnodes are enable. Also the hadoop performance of vnode enabled nodes are bad for there are so many splits. The idea is to combine vnode splits into a big sudo splits so it work like vnode is disable for hadoop/pig job -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig
[ https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317105#comment-14317105 ] mck commented on CASSANDRA-6091: With handling of multiple token rages, we loose a little accuracy in the resulting splitSize because token ranges are so much smaller (eg from the corner splits), but apart from such errors evening out, isn't the more important goal here is to have splits sized more consistently from splitSize so that users can tune and achieve a reasonably steady throughput of tasks? Better Vnode support in hadoop/pig -- Key: CASSANDRA-6091 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091 Project: Cassandra Issue Type: Bug Components: Hadoop Reporter: Alex Liu Assignee: Alex Liu CASSANDRA-6084 shows there are some issues during running hadoop/pig job if vnodes are enable. Also the hadoop performance of vnode enabled nodes are bad for there are so many splits. The idea is to combine vnode splits into a big sudo splits so it work like vnode is disable for hadoop/pig job -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig
[ https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223556#comment-14223556 ] mck commented on CASSANDRA-6091: [~jbellis], [~alexliu68] any thoughts on that last patch? I'm pretty keen to wrap it up w/ CFIF+CFRR and submit a proper patch for it all. Better Vnode support in hadoop/pig -- Key: CASSANDRA-6091 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091 Project: Cassandra Issue Type: Bug Components: Hadoop Reporter: Alex Liu Assignee: Alex Liu CASSANDRA-6084 shows there are some issues during running hadoop/pig job if vnodes are enable. Also the hadoop performance of vnode enabled nodes are bad for there are so many splits. The idea is to combine vnode splits into a big sudo splits so it work like vnode is disable for hadoop/pig job -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig
[ https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168561#comment-14168561 ] mck commented on CASSANDRA-6091: I guess this^ approach falls apart once with increasing number of nodes in a cluster (the chances of adjacent splits with same dataNodes drops quickly), and it comes back to splits with multiple token ranges and CRR supporting that. But i still don't get why you *need* to have any thrift/CQL server-side change (at least to begin with)? Better Vnode support in hadoop/pig -- Key: CASSANDRA-6091 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091 Project: Cassandra Issue Type: Bug Components: Hadoop Reporter: Alex Liu Assignee: Alex Liu CASSANDRA-6084 shows there are some issues during running hadoop/pig job if vnodes are enable. Also the hadoop performance of vnode enabled nodes are bad for there are so many splits. The idea is to combine vnode splits into a big sudo splits so it work like vnode is disable for hadoop/pig job -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig
[ https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168384#comment-14168384 ] mck commented on CASSANDRA-6091: [~alexliu68] Any reason we can't (at least to begin with since this is a right headache) join the splits after the call to describe_splits_ex, as is done in this [patch|https://github.com/michaelsembwever/cassandra/pull/1/files]? Although one thing about this patch that i haven't yet understood, when i test against a single-node cluster i expect rejoinAdjacentSplits(..) to return a list of one split, token range -8940796744825771419 to 6948181744525544, but instead i get that plus two more splits, in total three splits that look like 1) -8940796744825771419 to 6948181744525544 2) -1 to -8940796744825771419 3) 6948181744525544 to -8940796744825771419 This is highlighted by the assert statement commented out in ACFIF line 253 Am i doing something wrong or is describe_local_ring? Better Vnode support in hadoop/pig -- Key: CASSANDRA-6091 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091 Project: Cassandra Issue Type: Bug Components: Hadoop Reporter: Alex Liu Assignee: Alex Liu CASSANDRA-6084 shows there are some issues during running hadoop/pig job if vnodes are enable. Also the hadoop performance of vnode enabled nodes are bad for there are so many splits. The idea is to combine vnode splits into a big sudo splits so it work like vnode is disable for hadoop/pig job -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig
[ https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13798964#comment-13798964 ] Cyril Scetbon commented on CASSANDRA-6091: -- [~alexliu68] Why are you stopping this implementation ? Do we have a guarantee that this feature could not help a lot even when we have lot of data ? I'll do some tests with and without Vnodes. Better Vnode support in hadoop/pig -- Key: CASSANDRA-6091 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091 Project: Cassandra Issue Type: Bug Components: Hadoop Reporter: Alex Liu Assignee: Alex Liu CASSANDRA-6084 shows there are some issues during running hadoop/pig job if vnodes are enable. Also the hadoop performance of vnode enabled nodes are bad for there are so many splits. The idea is to combine vnode splits into a big sudo splits so it work like vnode is disable for hadoop/pig job -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig
[ https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13798976#comment-13798976 ] Jeremy Hanna commented on CASSANDRA-6091: - I think a factor that we've overlooked is data locality. With smaller ranges and the same input split size, there's a higher chance that the split will be outside of a single virtual token range. I have observed that in the job counters with vnodes enabled, only about a third of the tasks are data local. That would probably need some testing. The user was doing some tests with input split size. In any case if this is borne out in testing, it is the bigger problem. Better Vnode support in hadoop/pig -- Key: CASSANDRA-6091 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091 Project: Cassandra Issue Type: Bug Components: Hadoop Reporter: Alex Liu Assignee: Alex Liu CASSANDRA-6084 shows there are some issues during running hadoop/pig job if vnodes are enable. Also the hadoop performance of vnode enabled nodes are bad for there are so many splits. The idea is to combine vnode splits into a big sudo splits so it work like vnode is disable for hadoop/pig job -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig
[ https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799088#comment-13799088 ] Cyril Scetbon commented on CASSANDRA-6091: -- That's a good reason to continue working on that ticket :) Better Vnode support in hadoop/pig -- Key: CASSANDRA-6091 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091 Project: Cassandra Issue Type: Bug Components: Hadoop Reporter: Alex Liu Assignee: Alex Liu CASSANDRA-6084 shows there are some issues during running hadoop/pig job if vnodes are enable. Also the hadoop performance of vnode enabled nodes are bad for there are so many splits. The idea is to combine vnode splits into a big sudo splits so it work like vnode is disable for hadoop/pig job -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig
[ https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799246#comment-13799246 ] Alex Liu commented on CASSANDRA-6091: - Let's wait for the testing result on the larger data when vnodes are enable. As for now, the limit of thread pool should resolve the issue for generating splits. If data is bigger enough, there will be multiple splits for each vnode, so It doesn't help to have range merged and sudo splits. One potential issue with vnodes is there could be potential many small corner splits (the last split for the vnode). e.g. 256 vnodes per a node, so potential we could end with around 256 small corner splits. If we disable vnode, those small corner split will be merged into bigger splits. As for data locality, we need more investigation. It's related to the number of splits and number of tasks run on each node, and how busy each node. If the testing results show it's a bigger issue than we expected, I will implement the merge approach. Better Vnode support in hadoop/pig -- Key: CASSANDRA-6091 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091 Project: Cassandra Issue Type: Bug Components: Hadoop Reporter: Alex Liu Assignee: Alex Liu CASSANDRA-6084 shows there are some issues during running hadoop/pig job if vnodes are enable. Also the hadoop performance of vnode enabled nodes are bad for there are so many splits. The idea is to combine vnode splits into a big sudo splits so it work like vnode is disable for hadoop/pig job -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig
[ https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799632#comment-13799632 ] Jonathan Ellis commented on CASSANDRA-6091: --- bq. Why are you stopping this implementation Two words: opportunity cost. bq. With smaller ranges and the same input split size, there's a higher chance that the split will be outside of a single virtual token range. Hmm, good point. Let's see how that bears out. Better Vnode support in hadoop/pig -- Key: CASSANDRA-6091 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091 Project: Cassandra Issue Type: Bug Components: Hadoop Reporter: Alex Liu Assignee: Alex Liu CASSANDRA-6084 shows there are some issues during running hadoop/pig job if vnodes are enable. Also the hadoop performance of vnode enabled nodes are bad for there are so many splits. The idea is to combine vnode splits into a big sudo splits so it work like vnode is disable for hadoop/pig job -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig
[ https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13794482#comment-13794482 ] Alex Liu commented on CASSANDRA-6091: - The following code {code} ListTokenRange masterRangeNodes = getRangeMap(conf); {code} returns all the token ranges. We need find a way to merge the token ranges into bigger token ranges and keep the replica locations no change. Merging token ranges helps reduce the number of splits. The reduction rate depends on how random the token ranges are shuffled around the ring. It helps a lot if we could find a better shuffle algorithm to maximum the merging. Better Vnode support in hadoop/pig -- Key: CASSANDRA-6091 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091 Project: Cassandra Issue Type: Bug Components: Hadoop Reporter: Alex Liu Assignee: Alex Liu CASSANDRA-6084 shows there are some issues during running hadoop/pig job if vnodes are enable. Also the hadoop performance of vnode enabled nodes are bad for there are so many splits. The idea is to combine vnode splits into a big sudo splits so it work like vnode is disable for hadoop/pig job -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig
[ https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13794495#comment-13794495 ] Alex Liu commented on CASSANDRA-6091: - If two token ranges have the same list of endpoints, they can be merged into a sudo token range. e.g. {code} Range A [start_a, end_a) has endpoints[node_1, node_5, node_8] Range B [start_b, end_c) has endpoints[node_1, node_5, node_8] can be merged to sudo token range Sudo_Range_A_B [[start_a, end_a), [start_b, end_b)] {code} We need modify CqlPagingRecordRead to support multiple token ranges split Better Vnode support in hadoop/pig -- Key: CASSANDRA-6091 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091 Project: Cassandra Issue Type: Bug Components: Hadoop Reporter: Alex Liu Assignee: Alex Liu CASSANDRA-6084 shows there are some issues during running hadoop/pig job if vnodes are enable. Also the hadoop performance of vnode enabled nodes are bad for there are so many splits. The idea is to combine vnode splits into a big sudo splits so it work like vnode is disable for hadoop/pig job -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig
[ https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13794509#comment-13794509 ] Alex Liu commented on CASSANDRA-6091: - We need add a new thrift method to calculate the splits for multiple token ranges as followings {code} describe_splits(String cfName, ListPairString, String tokenRanges, int keys_per_split) {code} Better Vnode support in hadoop/pig -- Key: CASSANDRA-6091 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091 Project: Cassandra Issue Type: Bug Components: Hadoop Reporter: Alex Liu Assignee: Alex Liu CASSANDRA-6084 shows there are some issues during running hadoop/pig job if vnodes are enable. Also the hadoop performance of vnode enabled nodes are bad for there are so many splits. The idea is to combine vnode splits into a big sudo splits so it work like vnode is disable for hadoop/pig job -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig
[ https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13794514#comment-13794514 ] Alex Liu commented on CASSANDRA-6091: - [~jbellis]Any comments before I implement it? Better Vnode support in hadoop/pig -- Key: CASSANDRA-6091 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091 Project: Cassandra Issue Type: Bug Components: Hadoop Reporter: Alex Liu Assignee: Alex Liu CASSANDRA-6084 shows there are some issues during running hadoop/pig job if vnodes are enable. Also the hadoop performance of vnode enabled nodes are bad for there are so many splits. The idea is to combine vnode splits into a big sudo splits so it work like vnode is disable for hadoop/pig job -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig
[ https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13794520#comment-13794520 ] Jonathan Ellis commented on CASSANDRA-6091: --- # Better to implement it in CQL so we're not adding Thrift dependencies. # I still think this is low priority since for non-toy datasets, the number of splits will outnumber the vnode count anyway Better Vnode support in hadoop/pig -- Key: CASSANDRA-6091 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091 Project: Cassandra Issue Type: Bug Components: Hadoop Reporter: Alex Liu Assignee: Alex Liu CASSANDRA-6084 shows there are some issues during running hadoop/pig job if vnodes are enable. Also the hadoop performance of vnode enabled nodes are bad for there are so many splits. The idea is to combine vnode splits into a big sudo splits so it work like vnode is disable for hadoop/pig job -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig
[ https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13794532#comment-13794532 ] Cyril Scetbon commented on CASSANDRA-6091: -- bq.I still think this is low priority since for non-toy datasets, there will be multiple splits per vnode anyway. It's not only a performance issue as reported by [CASSANDRA-6084|https://issues.apache.org/jira/browse/CASSANDRA-6084]. IMHO, getting an issue only because it spawns as much connections as the number of vnodes hold by the host is not a low priority problem. Better Vnode support in hadoop/pig -- Key: CASSANDRA-6091 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091 Project: Cassandra Issue Type: Bug Components: Hadoop Reporter: Alex Liu Assignee: Alex Liu CASSANDRA-6084 shows there are some issues during running hadoop/pig job if vnodes are enable. Also the hadoop performance of vnode enabled nodes are bad for there are so many splits. The idea is to combine vnode splits into a big sudo splits so it work like vnode is disable for hadoop/pig job -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig
[ https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13794570#comment-13794570 ] Jonathan Ellis commented on CASSANDRA-6091: --- bq. it spawns as much connections as the number of vnodes Already addressed by CASSANDRA-6169 Better Vnode support in hadoop/pig -- Key: CASSANDRA-6091 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091 Project: Cassandra Issue Type: Bug Components: Hadoop Reporter: Alex Liu Assignee: Alex Liu CASSANDRA-6084 shows there are some issues during running hadoop/pig job if vnodes are enable. Also the hadoop performance of vnode enabled nodes are bad for there are so many splits. The idea is to combine vnode splits into a big sudo splits so it work like vnode is disable for hadoop/pig job -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig
[ https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13794589#comment-13794589 ] Cyril Scetbon commented on CASSANDRA-6091: -- bq. Already addressed by CASSANDRA-6169 Right, I saw it last week :) Better Vnode support in hadoop/pig -- Key: CASSANDRA-6091 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091 Project: Cassandra Issue Type: Bug Components: Hadoop Reporter: Alex Liu Assignee: Alex Liu CASSANDRA-6084 shows there are some issues during running hadoop/pig job if vnodes are enable. Also the hadoop performance of vnode enabled nodes are bad for there are so many splits. The idea is to combine vnode splits into a big sudo splits so it work like vnode is disable for hadoop/pig job -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig
[ https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13794651#comment-13794651 ] Alex Liu commented on CASSANDRA-6091: - Another solution is to get all the splits for multiple ranges of the same node in a single request instead of multiple requests and each request per a range. I am holding off the implementing this feature unless we want it later. Better Vnode support in hadoop/pig -- Key: CASSANDRA-6091 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091 Project: Cassandra Issue Type: Bug Components: Hadoop Reporter: Alex Liu Assignee: Alex Liu CASSANDRA-6084 shows there are some issues during running hadoop/pig job if vnodes are enable. Also the hadoop performance of vnode enabled nodes are bad for there are so many splits. The idea is to combine vnode splits into a big sudo splits so it work like vnode is disable for hadoop/pig job -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig
[ https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13790734#comment-13790734 ] Robert Coli commented on CASSANDRA-6091: CASSANDRA-6169 is semi-related, limits the number of threads in the thread pool to be less than the number of splits. Better Vnode support in hadoop/pig -- Key: CASSANDRA-6091 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091 Project: Cassandra Issue Type: Bug Components: Hadoop Reporter: Alex Liu Assignee: Alex Liu CASSANDRA-6084 shows there are some issues during running hadoop/pig job if vnodes are enable. Also the hadoop performance of vnode enabled nodes are bad for there are so many splits. The idea is to combine vnode splits into a big sudo splits so it work like vnode is disable for hadoop/pig job -- This message was sent by Atlassian JIRA (v6.1#6144)