I know thrift and python and Unicode don't mix. 

 

On May 7, 2011, at 4:21 PM, aaron morton <aa...@thelastpickle.com> wrote:

> I've been able to reproduce the fault using python on my mac book see 
> https://github.com/amorton/cassandra-unicode-bug
> 
> When we try to find the unicode key in the index in 0.7 it fails because the 
> tokens are different. The readme in the github project has more info. 
> 
> Any thoughts?  Will try to find some more time to keep digging. 
> 
> Aaron
> 
> 
> On 7 May 2011, at 20:44, aaron morton wrote:
> 
>> get_range_slices() does read repair if enabled (checked 
>> DoConsistencyChecksBoolean in the config, it's on by default) so you should 
>> be getting good reads. If you want belt-and-braces run nodetool repair 
>> first. 
>> 
>> Hope that helps.
>> 
>> 
>> On 7 May 2011, at 11:46, Jeremy Hanna wrote:
>> 
>>> Great!  I just wanted to make sure you were getting the information you 
>>> needed.
>>> 
>>> On May 6, 2011, at 6:42 PM, Henrik Schröder wrote:
>>> 
>>>> Well, I already completed the migration program. Using get_range_slices I 
>>>> could migrate a few thousand rows per second, which means that migrating 
>>>> all of our data would take a few minutes, and we'll end up with pristine 
>>>> datafiles for the new cluster. Problem solved!
>>>> 
>>>> I'll see if I can create datafiles in 0.6 that are uncleanable in 0.7 so 
>>>> that you all can repeat this and hopefully fix it.
>>>> 
>>>> 
>>>> /Henrik Schröder
>>>> 
>>>> On Sat, May 7, 2011 at 00:35, Jeremy Hanna <jeremy.hanna1...@gmail.com> 
>>>> wrote:
>>>> If you're able, go into the #cassandra channel on freenode (IRC) and talk 
>>>> to driftx or jbellis or aaron_morton about your problem.  It could be that 
>>>> you don't have to do all of this based on a conversation there.
>>>> 
>>>> On May 6, 2011, at 5:04 AM, Henrik Schröder wrote:
>>>> 
>>>>> I'll see if I can make some example broken files this weekend.
>>>>> 
>>>>> 
>>>>> /Henrik Schröder
>>>>> 
>>>>> On Fri, May 6, 2011 at 02:10, aaron morton <aa...@thelastpickle.com> 
>>>>> wrote:
>>>>> The difficulty is the different thrift clients between 0.6 and 0.7.
>>>>> 
>>>>> If you want to roll your own solution I would consider:
>>>>> - write an app to talk to 0.6 and pull out the data using keys from the 
>>>>> other system (so you know can check referential integrity while you are 
>>>>> at it). Dump the data to flat file.
>>>>> - write an app to talk to 0.7 to load the data back in.
>>>>> 
>>>>> I've not given up digging on your migration problem, having to manually 
>>>>> dump and reload if you've done nothing wrong is not the best solution. 
>>>>> I'll try to find some time this weekend to test with:
>>>>> 
>>>>> - 0.6 server, random paritioner, standard CF's, byte column
>>>>> - load with python or the cli on osx or ubuntu (dont have a window 
>>>>> machine any more)
>>>>> - migrate and see whats going on.
>>>>> 
>>>>> If you can spare some sample data to load please send it over in the user 
>>>>> group or my email address.
>>>>> 
>>>>> Cheers
>>>>> 
>>>>> -----------------
>>>>> Aaron Morton
>>>>> Freelance Cassandra Developer
>>>>> @aaronmorton
>>>>> http://www.thelastpickle.com
>>>>> 
>>>>> On 6 May 2011, at 05:52, Henrik Schröder wrote:
>>>>> 
>>>>>> We can't do a straight upgrade from 0.6.13 to 0.7.5 because we have rows 
>>>>>> stored that have unicode keys, and Cassandra 0.7.5 thinks those rows in 
>>>>>> the sstables are corrupt, and it seems impossible to clean it up without 
>>>>>> losing data.
>>>>>> 
>>>>>> However, we can still read all rows perfectly via thrift so we are now 
>>>>>> looking at building a simple tool that will copy all rows from our 0.6.3 
>>>>>> cluster to a parallell 0.7.5 cluster. Our question is now how to do that 
>>>>>> and ensure that we actually get all rows migrated? It's a pretty small 
>>>>>> cluster, 3 machines, a single keyspace, a singke columnfamily, ~2 
>>>>>> million rows, a few GB of data, and a replication factor of 3.
>>>>>> 
>>>>>> So what's the best way? Call get_range_slices and move through the 
>>>>>> entire token space? We also have all row keys in a secondary system, 
>>>>>> would it be better to use that and make calls to get_multi or 
>>>>>> get_multi_slices instead? Are we correct in assuming that if we use the 
>>>>>> consistencylevel ALL we'll get all rows?
>>>>>> 
>>>>>> 
>>>>>> /Henrik Schröder
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>> 
> 

Reply via email to