[
https://issues.apache.org/jira/browse/PHOENIX-2240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736423#comment-14736423
]
James Heather commented on PHOENIX-2240:
----------------------------------------
Note that for 100M rows the script does indeed generate approximately the right
number; but the approximation will get worse and worse as the number of rows
requested increases. I might try it on a billion rows to see how far short it
falls.
In fact, in the limit, the number of rows generated will be constant, because
the number of possible primary keys is finite...
I'm not sure that fixing this completely will be easy. You can't just bung the
generated rows into a hashset and check that you don't create a duplicate,
because for a very large number of rows you won't be able to hold them all in
RAM.
A couple of things you could do:
(1) Increase the entropy. Is the full timestamp precision and range being used?
(Are you using all the bits?) Could another random value be added into the
primary key?
(2) Generate the timestamps according to some fixed scheme rather than
randomly. For instance, divide the entire range by the number of rows, and then
generate timestamps in the appropriate segment for each row. You'll then end up
with monotonically increasing timestamps, which might or might not be
acceptable.
I suspect that the full timestamp precision isn't being used. At least, the
number of unique rows (just considering the PK parts) in the CSV does seem to
correspond to the number of rows upserted, so certainly the CSV timestamp
output isn't *more* precise than Phoenix can store. Probably it's *less*
precise.
> Duplicate keys generated by performance.py script
> -------------------------------------------------
>
> Key: PHOENIX-2240
> URL: https://issues.apache.org/jira/browse/PHOENIX-2240
> Project: Phoenix
> Issue Type: Bug
> Reporter: Mujtaba Chohan
> Assignee: Mujtaba Chohan
> Priority: Minor
>
> 500 out of 100M rows are duplicate. See details at
> http://search-hadoop.com/m/9UY0h26jwA21rW0i1/v=threaded
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)