[
https://issues.apache.org/jira/browse/PHOENIX-2607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Arun Thangamani updated PHOENIX-2607:
-------------------------------------
Attachment: PHOENIX-2607.patch
> PhoenixMapReduceUtil Upserts with earlier ts (relative to latest data ts)
> slower by 25x after stats collection
> --------------------------------------------------------------------------------------------------------------
>
> Key: PHOENIX-2607
> URL: https://issues.apache.org/jira/browse/PHOENIX-2607
> Project: Phoenix
> Issue Type: Bug
> Affects Versions: 4.4.0
> Reporter: Arun Thangamani
> Labels: patch
> Fix For: 4.4.0
>
> Attachments: PHOENIX-2607.patch, hbase-master-fast-upload.log,
> hbase-master-slow-upload.log, hbase-rs01-fast-upload.log,
> hbase-rs01-slow-upload.log, hbase-rs02-fast-upload.log,
> hbase-rs02-slow-upload.log, hbase-rs03-fast-upload.log,
> hbase-rs03-slow-upload.log, hbase-rs04-fast-upload.log,
> hbase-rs04-slow-upload.log, phoenix_slow_map_process_jstack.txt,
> region_server_2_jstack.txt
>
>
> Description of the problem:
> 1) We face a 25x slow down when go back in time to load data in a table (when
> specific timestamps set on connections during upserts)
> 2) set phoenix.stats.useCurrentTime=false (and
> phoenix.stats.guidepost.per.region 1) which at least makes the forward
> timestamps upserts perform correctly
> 3) From what I can tell from the phoenix source code, logs attached and
> jstacks from the region servers -- we continuously try to lookup the uncached
> definition of the table when we have client timestamp earlier than the last
> modified timestamp of the table in stats
> 4) To reproduce, create a table with timestamp=100, and load 10M rows with
> PhoenixMapReduceUtil and timestamps=1447574400000,1448092800000, wait for 20
> mins (15+ min, phoenix.stats.updateFrequency is 15mins)
> After 20 mins, load 10M rows with a earlier timestamp compared to the latest
> data (timestamp=1447660800000) and observe the 25x slowness, after this once
> again load a forward timestamp 1448179200000 and observe the quickness
> 5) I was not able to reproduce this issue with simple multi threaded upserts
> from a jdbc connection, with simple multi threaded upserts the stats table
> never gets populated unlike PhoenixMapReduceUtil
> We are trying to use phoenix as a cache store to do analytics with the last
> 60 days of data, a total of about 1.5 billion rows
> The table has a composite key and the data arrives in different times from
> different sources, so it is easier to maintain the timestamps of the data and
> expire the data automatically, this performance makes a difference between
> inserting the data in 10 mins versus 2 hours, 2 hours for data inserts
> blocking up the cluster that we have.
> We are even talking about our use cases in the upcoming strata conference in
> March.. (Thanks to excellent community)
> Steps to reproduce:
> Source code is available in
> (https://github.com/athangamani/phoenix_mapreduce_timestamp_upsert) and the
> jar the source code produces is attached which is readily runnable
> 1) We use the following params to keep the stats collection happy to isolate
> the specific issue
> phoenix.stats.useCurrentTime false
> phoenix.stats.guidepost.per.region 1
> 2) Create a table in phoenix
> Run the following main class from the project..
> (StatPhoenixTableCreationTest).. It will create a table with timestamp=100
> CREATE TABLE stat_table (
> pk1 VARCHAR NOT NULL,
> pk2 VARCHAR NOT NULL,
> pk3 UNSIGNED_LONG NOT NULL,
> stat1 UNSIGNED_LONG,
> stat2 UNSIGNED_LONG,
> stat3 UNSIGNED_LONG,
> CONSTRAINT pk PRIMARY KEY (pk1, pk2, pk3)
> ) SALT_BUCKETS=32, COMPRESSION='LZ4'
> 3) Open the code base to look at the sample for PhoenixMapReduceUtil.. With
> DBWritable..
> 4) Within the codebase, we get phoenix connection for the mappers using the
> following settings in order to have a fixed client timestamp
> conf.set(PhoenixRuntime.CURRENT_SCN_ATTRIB, ""+(timestamp));
> 5) fix the hbase-site.xml in the codebase for zookeeper quorum and hbase
> parent znode info
> 6) simply run the StatDataCreatorTest to create data for the run and load it
> in hdfs for 10M records
> 7) to run the ready made jar attached, use the following commands,
> hadoop jar phoenix_mr_ts_upsert-jar-with-dependencies.jar
> statPhoenixLoader hdfs:///user/*****/stat-data-1.txt STAT_TABLE 1447574400000
> hadoop jar phoenix_mr_ts_upsert-jar-with-dependencies.jar
> statPhoenixLoader hdfs:///user/*****/stat-data-1.txt STAT_TABLE 1448092800000
> After 20 mins…
> hadoop jar phoenix_mr_ts_upsert-jar-with-dependencies.jar
> statPhoenixLoader hdfs:///user/*****/stat-data-1.txt STAT_TABLE 1447660800000
> hadoop jar phoenix_mr_ts_upsert-jar-with-dependencies.jar
> statPhoenixLoader hdfs:///user/*****/stat-data-1.txt STAT_TABLE 1449000000000
> 8) observe the 25x slowness in the 3rd run and observe the normal pace in the
> 4th run
> 9) Attached are the region server logs for the fast process in forward
> progression versus the slow progress in reverse timestamp
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)