[ 
https://issues.apache.org/jira/browse/PHOENIX-2240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736423#comment-14736423
 ] 

James Heather commented on PHOENIX-2240:
----------------------------------------

Note that for 100M rows the script does indeed generate approximately the right 
number; but the approximation will get worse and worse as the number of rows 
requested increases. I might try it on a billion rows to see how far short it 
falls.

In fact, in the limit, the number of rows generated will be constant, because 
the number of possible primary keys is finite...

I'm not sure that fixing this completely will be easy. You can't just bung the 
generated rows into a hashset and check that you don't create a duplicate, 
because for a very large number of rows you won't be able to hold them all in 
RAM.

A couple of things you could do:

(1) Increase the entropy. Is the full timestamp precision and range being used? 
(Are you using all the bits?) Could another random value be added into the 
primary key?
(2) Generate the timestamps according to some fixed scheme rather than 
randomly. For instance, divide the entire range by the number of rows, and then 
generate timestamps in the appropriate segment for each row. You'll then end up 
with monotonically increasing timestamps, which might or might not be 
acceptable.

I suspect that the full timestamp precision isn't being used. At least, the 
number of unique rows (just considering the PK parts) in the CSV does seem to 
correspond to the number of rows upserted, so certainly the CSV timestamp 
output isn't *more* precise than Phoenix can store. Probably it's *less* 
precise.

> Duplicate keys generated by performance.py script
> -------------------------------------------------
>
>                 Key: PHOENIX-2240
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-2240
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: Mujtaba Chohan
>            Assignee: Mujtaba Chohan
>            Priority: Minor
>
> 500 out of 100M rows are duplicate. See details at 
> http://search-hadoop.com/m/9UY0h26jwA21rW0i1/v=threaded



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to