[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig

2015-11-11 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000768#comment-15000768
 ] 

Aleksey Yeschenko commented on CASSANDRA-6091:
--

[~michaelsembwever] I don't know why it was moved to {{Testing}}, and I don't 
know if it's still relevant. Sorry for the annoying delay.

At this point it will most likely not go into 2.1.x and 2.2.x (or 3.0.x), but, 
if still relevant for 3.x, might go into 3.2. Can you have a look/cook a proper 
patch, if so?

> Better Vnode support in hadoop/pig
> --
>
> Key: CASSANDRA-6091
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6091
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Alex Liu
>Assignee: mck
>Priority: Minor
> Attachments: cassandra-2.0-6091.txt, cassandra-2.1-6091.txt, 
> trunk-6091.txt
>
>
> CASSANDRA-6084 shows there are some issues during running hadoop/pig job if 
> vnodes are enable. Also the hadoop performance of vnode enabled nodes  are 
> bad for there are so many splits.
> The idea is to combine vnode splits into a big sudo splits so it work like 
> vnode is disable for hadoop/pig job



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig

2015-06-30 Thread mck (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608706#comment-14608706
 ] 

mck commented on CASSANDRA-6091:


this has been in Testing for over 3 months. wazzup?

 Better Vnode support in hadoop/pig
 --

 Key: CASSANDRA-6091
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Reporter: Alex Liu
Assignee: mck
 Attachments: cassandra-2.0-6091.txt, cassandra-2.1-6091.txt, 
 trunk-6091.txt


 CASSANDRA-6084 shows there are some issues during running hadoop/pig job if 
 vnodes are enable. Also the hadoop performance of vnode enabled nodes  are 
 bad for there are so many splits.
 The idea is to combine vnode splits into a big sudo splits so it work like 
 vnode is disable for hadoop/pig job



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig

2015-03-12 Thread mck (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14359104#comment-14359104
 ] 

mck commented on CASSANDRA-6091:


new patches on their way for both 2.0 and 2.1. 

(there's a silly npe in CqlRecordReader in the first patch so i've removed it, 
but i don't know how to transition the issue back to in progress or opened 
status).

 Better Vnode support in hadoop/pig
 --

 Key: CASSANDRA-6091
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Reporter: Alex Liu
Assignee: mck

 CASSANDRA-6084 shows there are some issues during running hadoop/pig job if 
 vnodes are enable. Also the hadoop performance of vnode enabled nodes  are 
 bad for there are so many splits.
 The idea is to combine vnode splits into a big sudo splits so it work like 
 vnode is disable for hadoop/pig job



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig

2015-03-05 Thread Alex Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14349029#comment-14349029
 ] 

Alex Liu commented on CASSANDRA-6091:
-

yes, please add  cassandra-2.1 as well. If we can cleanly merge it into trunk, 
we need another one for trunk.

 Better Vnode support in hadoop/pig
 --

 Key: CASSANDRA-6091
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Reporter: Alex Liu
Assignee: mck
 Attachments: cassandra-2.0-6091.txt


 CASSANDRA-6084 shows there are some issues during running hadoop/pig job if 
 vnodes are enable. Also the hadoop performance of vnode enabled nodes  are 
 bad for there are so many splits.
 The idea is to combine vnode splits into a big sudo splits so it work like 
 vnode is disable for hadoop/pig job



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig

2015-03-05 Thread mck (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348865#comment-14348865
 ] 

mck commented on CASSANDRA-6091:


oh …LGTM :-)

 Better Vnode support in hadoop/pig
 --

 Key: CASSANDRA-6091
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Reporter: Alex Liu
Assignee: mck
 Attachments: cassandra-2.0-6091.txt


 CASSANDRA-6084 shows there are some issues during running hadoop/pig job if 
 vnodes are enable. Also the hadoop performance of vnode enabled nodes  are 
 bad for there are so many splits.
 The idea is to combine vnode splits into a big sudo splits so it work like 
 vnode is disable for hadoop/pig job



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig

2015-03-05 Thread mck (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348468#comment-14348468
 ] 

mck commented on CASSANDRA-6091:


no idea what LGMT means :-/

do you need cassandra-2.1 and trunk patches as well?

 Better Vnode support in hadoop/pig
 --

 Key: CASSANDRA-6091
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Reporter: Alex Liu
Assignee: mck
 Attachments: cassandra-2.0-6091.txt


 CASSANDRA-6084 shows there are some issues during running hadoop/pig job if 
 vnodes are enable. Also the hadoop performance of vnode enabled nodes  are 
 bad for there are so many splits.
 The idea is to combine vnode splits into a big sudo splits so it work like 
 vnode is disable for hadoop/pig job



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig

2015-03-04 Thread Alex Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14347173#comment-14347173
 ] 

Alex Liu commented on CASSANDRA-6091:
-

LGMT+1

 Better Vnode support in hadoop/pig
 --

 Key: CASSANDRA-6091
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Reporter: Alex Liu
Assignee: mck
 Attachments: cassandra-2.0-6091.txt


 CASSANDRA-6084 shows there are some issues during running hadoop/pig job if 
 vnodes are enable. Also the hadoop performance of vnode enabled nodes  are 
 bad for there are so many splits.
 The idea is to combine vnode splits into a big sudo splits so it work like 
 vnode is disable for hadoop/pig job



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig

2015-02-21 Thread mck (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14330467#comment-14330467
 ] 

mck commented on CASSANDRA-6091:


patch submitted.

 Better Vnode support in hadoop/pig
 --

 Key: CASSANDRA-6091
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Reporter: Alex Liu
Assignee: Alex Liu
 Attachments: cassandra-2.0-6091.txt


 CASSANDRA-6084 shows there are some issues during running hadoop/pig job if 
 vnodes are enable. Also the hadoop performance of vnode enabled nodes  are 
 bad for there are so many splits.
 The idea is to combine vnode splits into a big sudo splits so it work like 
 vnode is disable for hadoop/pig job



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig

2015-02-11 Thread Jeremy Hanna (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316544#comment-14316544
 ] 

Jeremy Hanna commented on CASSANDRA-6091:
-

[~michaelsembwever] can you give a high level description of the approach 
you're taking?

 Better Vnode support in hadoop/pig
 --

 Key: CASSANDRA-6091
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Reporter: Alex Liu
Assignee: Alex Liu

 CASSANDRA-6084 shows there are some issues during running hadoop/pig job if 
 vnodes are enable. Also the hadoop performance of vnode enabled nodes  are 
 bad for there are so many splits.
 The idea is to combine vnode splits into a big sudo splits so it work like 
 vnode is disable for hadoop/pig job



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig

2015-02-11 Thread Alex Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316667#comment-14316667
 ] 

Alex Liu commented on CASSANDRA-6091:
-

We need at least check the total estimated rows of multiple token ranges per 
split, instead of multiple taken ranges per node.

 Better Vnode support in hadoop/pig
 --

 Key: CASSANDRA-6091
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Reporter: Alex Liu
Assignee: Alex Liu

 CASSANDRA-6084 shows there are some issues during running hadoop/pig job if 
 vnodes are enable. Also the hadoop performance of vnode enabled nodes  are 
 bad for there are so many splits.
 The idea is to combine vnode splits into a big sudo splits so it work like 
 vnode is disable for hadoop/pig job



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig

2015-02-11 Thread JIRA

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316635#comment-14316635
 ] 

Piotr Kołaczkowski commented on CASSANDRA-6091:
---

The way we dealt with this problem in Spark connector was to allow multiple 
token ranges per split. I don't think there is any other way, as the number of 
adjacent token ranges is going to drop very quickly with the size of the 
cluster.

 Better Vnode support in hadoop/pig
 --

 Key: CASSANDRA-6091
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Reporter: Alex Liu
Assignee: Alex Liu

 CASSANDRA-6084 shows there are some issues during running hadoop/pig job if 
 vnodes are enable. Also the hadoop performance of vnode enabled nodes  are 
 bad for there are so many splits.
 The idea is to combine vnode splits into a big sudo splits so it work like 
 vnode is disable for hadoop/pig job



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig

2015-02-11 Thread mck (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317090#comment-14317090
 ] 

mck commented on CASSANDRA-6091:


The approach in the patch is to do allow multiple token ranges per split.
We do with our custom input formats, and it is (very) effective in that it 
means splitSize is honoured.

Handling multiple token ranges per split requires for example the code change 
found in CqlRecordReader whereby the reader must iterate over both rows and 
tokenRanges.

The grouping of token rages by common location sets, so that splits again 
honour the splitSize, happens in 
AbstractColumnFamilyInputForma.collectSplits(..)

Token ranges do not need to be adjacent.
Everything in this patch is done client-side.



 Better Vnode support in hadoop/pig
 --

 Key: CASSANDRA-6091
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Reporter: Alex Liu
Assignee: Alex Liu

 CASSANDRA-6084 shows there are some issues during running hadoop/pig job if 
 vnodes are enable. Also the hadoop performance of vnode enabled nodes  are 
 bad for there are so many splits.
 The idea is to combine vnode splits into a big sudo splits so it work like 
 vnode is disable for hadoop/pig job



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig

2015-02-11 Thread Alex Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317223#comment-14317223
 ] 

Alex Liu commented on CASSANDRA-6091:
-

One more improvement is combining adjacent token ranges into one range. But it 
will create some small corner ranges, so the trade off is not that good as as 
the number of adjacent token ranges is going to drop very quickly with the size 
of the cluster.
Reply

 Better Vnode support in hadoop/pig
 --

 Key: CASSANDRA-6091
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Reporter: Alex Liu
Assignee: Alex Liu

 CASSANDRA-6084 shows there are some issues during running hadoop/pig job if 
 vnodes are enable. Also the hadoop performance of vnode enabled nodes  are 
 bad for there are so many splits.
 The idea is to combine vnode splits into a big sudo splits so it work like 
 vnode is disable for hadoop/pig job



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig

2015-02-11 Thread mck (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317105#comment-14317105
 ] 

mck commented on CASSANDRA-6091:


With handling of multiple token rages, we loose a little accuracy in the 
resulting splitSize because token ranges are so much smaller (eg from the 
corner splits), but apart from such errors evening out, isn't the more 
important goal here is to have splits sized more consistently from splitSize so 
that users can tune and achieve a reasonably steady throughput of tasks?

 Better Vnode support in hadoop/pig
 --

 Key: CASSANDRA-6091
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Reporter: Alex Liu
Assignee: Alex Liu

 CASSANDRA-6084 shows there are some issues during running hadoop/pig job if 
 vnodes are enable. Also the hadoop performance of vnode enabled nodes  are 
 bad for there are so many splits.
 The idea is to combine vnode splits into a big sudo splits so it work like 
 vnode is disable for hadoop/pig job



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig

2014-11-24 Thread mck (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223556#comment-14223556
 ] 

mck commented on CASSANDRA-6091:


[~jbellis], [~alexliu68] any thoughts on that last patch? I'm pretty keen to 
wrap it up w/ CFIF+CFRR and submit a proper patch for it all.

 Better Vnode support in hadoop/pig
 --

 Key: CASSANDRA-6091
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Reporter: Alex Liu
Assignee: Alex Liu

 CASSANDRA-6084 shows there are some issues during running hadoop/pig job if 
 vnodes are enable. Also the hadoop performance of vnode enabled nodes  are 
 bad for there are so many splits.
 The idea is to combine vnode splits into a big sudo splits so it work like 
 vnode is disable for hadoop/pig job



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig

2014-10-12 Thread mck (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168561#comment-14168561
 ] 

mck commented on CASSANDRA-6091:


I guess this^ approach falls apart once with increasing number of nodes in a 
cluster (the chances of adjacent splits with same dataNodes drops quickly), and 
it comes back to splits with multiple token ranges and CRR supporting that. 
But i still don't get why you *need* to have any thrift/CQL server-side change 
(at least to begin with)? 

 Better Vnode support in hadoop/pig
 --

 Key: CASSANDRA-6091
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Reporter: Alex Liu
Assignee: Alex Liu

 CASSANDRA-6084 shows there are some issues during running hadoop/pig job if 
 vnodes are enable. Also the hadoop performance of vnode enabled nodes  are 
 bad for there are so many splits.
 The idea is to combine vnode splits into a big sudo splits so it work like 
 vnode is disable for hadoop/pig job



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig

2014-10-11 Thread mck (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168384#comment-14168384
 ] 

mck commented on CASSANDRA-6091:


[~alexliu68] Any reason we can't (at least to begin with since this is a right 
headache) join the splits after the call to describe_splits_ex, as is done in 
this [patch|https://github.com/michaelsembwever/cassandra/pull/1/files]?

Although one thing about this patch that i haven't yet understood, when i test 
against a single-node cluster i expect rejoinAdjacentSplits(..) to return a 
list of one split, token range -8940796744825771419 to 6948181744525544, 
but instead i get that plus two more splits, in total three splits that look 
like

 1) -8940796744825771419 to 6948181744525544
 2) -1 to -8940796744825771419
 3) 6948181744525544 to -8940796744825771419

This is highlighted by the assert statement commented out in ACFIF line 253 
Am i doing something wrong or is describe_local_ring?
 

 Better Vnode support in hadoop/pig
 --

 Key: CASSANDRA-6091
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Reporter: Alex Liu
Assignee: Alex Liu

 CASSANDRA-6084 shows there are some issues during running hadoop/pig job if 
 vnodes are enable. Also the hadoop performance of vnode enabled nodes  are 
 bad for there are so many splits.
 The idea is to combine vnode splits into a big sudo splits so it work like 
 vnode is disable for hadoop/pig job



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig

2013-10-18 Thread Cyril Scetbon (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13798964#comment-13798964
 ] 

Cyril Scetbon commented on CASSANDRA-6091:
--

[~alexliu68] Why are you stopping this implementation ? Do we have a guarantee 
that this feature could not help a lot even when we have lot of data ? I'll do 
some tests with and without Vnodes.

 Better Vnode support in hadoop/pig
 --

 Key: CASSANDRA-6091
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Reporter: Alex Liu
Assignee: Alex Liu

 CASSANDRA-6084 shows there are some issues during running hadoop/pig job if 
 vnodes are enable. Also the hadoop performance of vnode enabled nodes  are 
 bad for there are so many splits.
 The idea is to combine vnode splits into a big sudo splits so it work like 
 vnode is disable for hadoop/pig job



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig

2013-10-18 Thread Jeremy Hanna (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13798976#comment-13798976
 ] 

Jeremy Hanna commented on CASSANDRA-6091:
-

I think a factor that we've overlooked is data locality.  With smaller ranges 
and the same input split size, there's a higher chance that the split will be 
outside of a single virtual token range.  I have observed that in the job 
counters with vnodes enabled, only about a third of the tasks are data local.  
That would probably need some testing.  The user was doing some tests with 
input split size.

In any case if this is borne out in testing, it is the bigger problem.

 Better Vnode support in hadoop/pig
 --

 Key: CASSANDRA-6091
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Reporter: Alex Liu
Assignee: Alex Liu

 CASSANDRA-6084 shows there are some issues during running hadoop/pig job if 
 vnodes are enable. Also the hadoop performance of vnode enabled nodes  are 
 bad for there are so many splits.
 The idea is to combine vnode splits into a big sudo splits so it work like 
 vnode is disable for hadoop/pig job



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig

2013-10-18 Thread Cyril Scetbon (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799088#comment-13799088
 ] 

Cyril Scetbon commented on CASSANDRA-6091:
--

That's a good reason to continue working on that ticket :)

 Better Vnode support in hadoop/pig
 --

 Key: CASSANDRA-6091
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Reporter: Alex Liu
Assignee: Alex Liu

 CASSANDRA-6084 shows there are some issues during running hadoop/pig job if 
 vnodes are enable. Also the hadoop performance of vnode enabled nodes  are 
 bad for there are so many splits.
 The idea is to combine vnode splits into a big sudo splits so it work like 
 vnode is disable for hadoop/pig job



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig

2013-10-18 Thread Alex Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799246#comment-13799246
 ] 

Alex Liu commented on CASSANDRA-6091:
-

Let's wait for the testing result on the larger data  when vnodes are enable. 
As for now, the limit of thread pool should resolve the issue for generating 
splits. If data is bigger enough, there will be multiple splits for each vnode, 
so It doesn't help to have range merged and sudo splits. One potential issue 
with vnodes is there could be potential many small corner splits (the last 
split for the vnode). e.g. 256 vnodes per a node, so potential we could end 
with around 256 small corner splits. If we disable vnode, those small corner 
split will be merged into bigger splits. 

As for data locality, we need more investigation. It's related to the number of 
splits and number of tasks run on each node, and how busy each node. 

If the testing results show it's a bigger issue than we expected, I will 
implement the merge approach.



 Better Vnode support in hadoop/pig
 --

 Key: CASSANDRA-6091
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Reporter: Alex Liu
Assignee: Alex Liu

 CASSANDRA-6084 shows there are some issues during running hadoop/pig job if 
 vnodes are enable. Also the hadoop performance of vnode enabled nodes  are 
 bad for there are so many splits.
 The idea is to combine vnode splits into a big sudo splits so it work like 
 vnode is disable for hadoop/pig job



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig

2013-10-18 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799632#comment-13799632
 ] 

Jonathan Ellis commented on CASSANDRA-6091:
---

bq. Why are you stopping this implementation

Two words: opportunity cost.

bq.  With smaller ranges and the same input split size, there's a higher chance 
that the split will be outside of a single virtual token range.

Hmm, good point.  Let's see how that bears out.

 Better Vnode support in hadoop/pig
 --

 Key: CASSANDRA-6091
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Reporter: Alex Liu
Assignee: Alex Liu

 CASSANDRA-6084 shows there are some issues during running hadoop/pig job if 
 vnodes are enable. Also the hadoop performance of vnode enabled nodes  are 
 bad for there are so many splits.
 The idea is to combine vnode splits into a big sudo splits so it work like 
 vnode is disable for hadoop/pig job



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig

2013-10-14 Thread Alex Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13794482#comment-13794482
 ] 

Alex Liu commented on CASSANDRA-6091:
-

The following code
{code}
   ListTokenRange masterRangeNodes = getRangeMap(conf);
{code}
returns all the token ranges. We need find a way to merge the token ranges into 
bigger token ranges and keep the replica locations no change.

Merging token ranges helps reduce the number of splits. The reduction rate 
depends on how random the token ranges are shuffled around the ring. It helps a 
lot if we could find a better shuffle algorithm to maximum the merging.

 Better Vnode support in hadoop/pig
 --

 Key: CASSANDRA-6091
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Reporter: Alex Liu
Assignee: Alex Liu

 CASSANDRA-6084 shows there are some issues during running hadoop/pig job if 
 vnodes are enable. Also the hadoop performance of vnode enabled nodes  are 
 bad for there are so many splits.
 The idea is to combine vnode splits into a big sudo splits so it work like 
 vnode is disable for hadoop/pig job



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig

2013-10-14 Thread Alex Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13794495#comment-13794495
 ] 

Alex Liu commented on CASSANDRA-6091:
-

If two token ranges have the same list of endpoints, they can be merged into a 
sudo token range. e.g.

{code}
   Range A [start_a, end_a)  has endpoints[node_1, node_5, node_8]
   Range B [start_b, end_c)  has endpoints[node_1, node_5, node_8]
  
   can be merged to 
   sudo token range Sudo_Range_A_B [[start_a, end_a), [start_b, end_b)]
 
{code}

We need modify CqlPagingRecordRead to support multiple token ranges split

 Better Vnode support in hadoop/pig
 --

 Key: CASSANDRA-6091
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Reporter: Alex Liu
Assignee: Alex Liu

 CASSANDRA-6084 shows there are some issues during running hadoop/pig job if 
 vnodes are enable. Also the hadoop performance of vnode enabled nodes  are 
 bad for there are so many splits.
 The idea is to combine vnode splits into a big sudo splits so it work like 
 vnode is disable for hadoop/pig job



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig

2013-10-14 Thread Alex Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13794509#comment-13794509
 ] 

Alex Liu commented on CASSANDRA-6091:
-

We need add a new thrift method to calculate the splits for multiple token 
ranges as followings
{code}
  describe_splits(String cfName, ListPairString, String tokenRanges, int 
keys_per_split)
{code}



 Better Vnode support in hadoop/pig
 --

 Key: CASSANDRA-6091
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Reporter: Alex Liu
Assignee: Alex Liu

 CASSANDRA-6084 shows there are some issues during running hadoop/pig job if 
 vnodes are enable. Also the hadoop performance of vnode enabled nodes  are 
 bad for there are so many splits.
 The idea is to combine vnode splits into a big sudo splits so it work like 
 vnode is disable for hadoop/pig job



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig

2013-10-14 Thread Alex Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13794514#comment-13794514
 ] 

Alex Liu commented on CASSANDRA-6091:
-

[~jbellis]Any comments before I implement it?

 Better Vnode support in hadoop/pig
 --

 Key: CASSANDRA-6091
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Reporter: Alex Liu
Assignee: Alex Liu

 CASSANDRA-6084 shows there are some issues during running hadoop/pig job if 
 vnodes are enable. Also the hadoop performance of vnode enabled nodes  are 
 bad for there are so many splits.
 The idea is to combine vnode splits into a big sudo splits so it work like 
 vnode is disable for hadoop/pig job



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig

2013-10-14 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13794520#comment-13794520
 ] 

Jonathan Ellis commented on CASSANDRA-6091:
---

# Better to implement it in CQL so we're not adding Thrift dependencies.
# I still think this is low priority since for non-toy datasets, the number of 
splits will outnumber the vnode count anyway

 Better Vnode support in hadoop/pig
 --

 Key: CASSANDRA-6091
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Reporter: Alex Liu
Assignee: Alex Liu

 CASSANDRA-6084 shows there are some issues during running hadoop/pig job if 
 vnodes are enable. Also the hadoop performance of vnode enabled nodes  are 
 bad for there are so many splits.
 The idea is to combine vnode splits into a big sudo splits so it work like 
 vnode is disable for hadoop/pig job



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig

2013-10-14 Thread Cyril Scetbon (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13794532#comment-13794532
 ] 

Cyril Scetbon commented on CASSANDRA-6091:
--

bq.I still think this is low priority since for non-toy datasets, there will be 
multiple splits per vnode anyway.
It's not only a performance issue as reported by 
[CASSANDRA-6084|https://issues.apache.org/jira/browse/CASSANDRA-6084]. IMHO, 
getting an issue only because it spawns as much connections as the number of 
vnodes hold by the host is not a low priority problem.

 Better Vnode support in hadoop/pig
 --

 Key: CASSANDRA-6091
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Reporter: Alex Liu
Assignee: Alex Liu

 CASSANDRA-6084 shows there are some issues during running hadoop/pig job if 
 vnodes are enable. Also the hadoop performance of vnode enabled nodes  are 
 bad for there are so many splits.
 The idea is to combine vnode splits into a big sudo splits so it work like 
 vnode is disable for hadoop/pig job



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig

2013-10-14 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13794570#comment-13794570
 ] 

Jonathan Ellis commented on CASSANDRA-6091:
---

bq. it spawns as much connections as the number of vnodes

Already addressed by CASSANDRA-6169

 Better Vnode support in hadoop/pig
 --

 Key: CASSANDRA-6091
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Reporter: Alex Liu
Assignee: Alex Liu

 CASSANDRA-6084 shows there are some issues during running hadoop/pig job if 
 vnodes are enable. Also the hadoop performance of vnode enabled nodes  are 
 bad for there are so many splits.
 The idea is to combine vnode splits into a big sudo splits so it work like 
 vnode is disable for hadoop/pig job



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig

2013-10-14 Thread Cyril Scetbon (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13794589#comment-13794589
 ] 

Cyril Scetbon commented on CASSANDRA-6091:
--

bq. Already addressed by CASSANDRA-6169
Right, I saw it last week :)

 Better Vnode support in hadoop/pig
 --

 Key: CASSANDRA-6091
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Reporter: Alex Liu
Assignee: Alex Liu

 CASSANDRA-6084 shows there are some issues during running hadoop/pig job if 
 vnodes are enable. Also the hadoop performance of vnode enabled nodes  are 
 bad for there are so many splits.
 The idea is to combine vnode splits into a big sudo splits so it work like 
 vnode is disable for hadoop/pig job



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig

2013-10-14 Thread Alex Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13794651#comment-13794651
 ] 

Alex Liu commented on CASSANDRA-6091:
-

Another solution is to get all the splits for multiple ranges of the same node 
in a single request instead of multiple requests and each request per a range.

I am holding off the implementing this feature unless we want it later. 

 Better Vnode support in hadoop/pig
 --

 Key: CASSANDRA-6091
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Reporter: Alex Liu
Assignee: Alex Liu

 CASSANDRA-6084 shows there are some issues during running hadoop/pig job if 
 vnodes are enable. Also the hadoop performance of vnode enabled nodes  are 
 bad for there are so many splits.
 The idea is to combine vnode splits into a big sudo splits so it work like 
 vnode is disable for hadoop/pig job



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6091) Better Vnode support in hadoop/pig

2013-10-09 Thread Robert Coli (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13790734#comment-13790734
 ] 

Robert Coli commented on CASSANDRA-6091:


CASSANDRA-6169 is semi-related, limits the number of threads in the thread pool 
to be less than the number of splits.

 Better Vnode support in hadoop/pig
 --

 Key: CASSANDRA-6091
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6091
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Reporter: Alex Liu
Assignee: Alex Liu

 CASSANDRA-6084 shows there are some issues during running hadoop/pig job if 
 vnodes are enable. Also the hadoop performance of vnode enabled nodes  are 
 bad for there are so many splits.
 The idea is to combine vnode splits into a big sudo splits so it work like 
 vnode is disable for hadoop/pig job



--
This message was sent by Atlassian JIRA
(v6.1#6144)