Thanks James; here is the JIRA with the code snippet for direct encoding: https://issues.apache.org/jira/browse/PHOENIX-1737
On Sat, Mar 14, 2015 at 10:32 AM, James Taylor <[email protected]> wrote: > Please file a JIRA, Tulasi. This is a fair point. I'm surprised it's > 4x faster. Can you share your code for the direct encoding path there > too? Are you still doing the CSV parsing in your code? Also, are you > sorting the KeyValues or do you know that they'll be in row key order > in the CSV file? > > In the meantime, I'll cleanup the original patch. I have one more > improvement I can make that's pretty straightforward too. > > Thanks, > James > > On Fri, Mar 13, 2015 at 2:54 PM, Tulasi Paradarami > <[email protected]> wrote: > >> > >> I don't know of any benchmarks vs. HBase bulk loader. Would be > interesting, > >> if you could come up with an apples-to-apples test. > > > > > > I did some testing to get an apples-to-apples comparison between the two > > options. > > > > For 10 million rows (primary key is a 3 column composite key with 3 > column > > qualifiers): > > JDBC bulk-loading: 430 sec (after applying PHOENIX-1711 patch) > > Direct Phoenix encoding: 112 sec > > > > Using direct encoding path executes in 1/4th the JDBC time and I think, > the > > difference is significant enough to provide APIs for direct Phoenix > > encoding in the bulk-loader. > > > > Thanks > > > > > > On Thu, Mar 5, 2015 at 2:13 PM, Nick Dimiduk <[email protected]> wrote: > > > >> I don't know of any benchmarks vs. HBase bulk loader. Would be > interesting, > >> if you could come up with an apples-to-apples test. > >> > >> 100TB binary file cannot be partitioned at all? You're always bound to a > >> single process. Bummer. I guess plan B could be pre-processing the > binary > >> file into something splittable. You'll cover the data twice, but if > Phoenix > >> encoding really is the current bottleneck, as your mail indicates, then > >> separating the decoding of the binary file from encoding of the Phoenix > >> output should allow for parallelizing the second step and improve the > state > >> of things. > >> > >> Mean time, would be good to look at perf improvements of the Phoenix > >> encoding step. Any volunteers lurking about? > >> > >> -n > >> > >> On Thu, Mar 5, 2015 at 1:08 PM, Tulasi Paradarami < > >> [email protected]> wrote: > >> > >> > Gabriel, Nick, thanks for your inputs. My comments below. > >> > > >> > Although it may look as though data is being written over the wire to > >> > > Phoenix, the execution of an upsert executor and retrieval of the > >> > > uncommitted KeyValues is all local (in memory). The code is > implemented > >> > in > >> > > this way because JDBC is the general API used within Phoenix -- > there > >> > isn't > >> > > direct "convert fields to Phoenix encoding" API, although this is > doing > >> > the > >> > > equivalent operation. > >> > > >> > I understand, data processing is in memory but performance can be > >> improved > >> > if there is a direct conversion to Phoenix encoding. > >> > Are there any performance comparison results between phoenix & hbase > >> > bulk-loader? > >> > > >> > Could you give some more information on your performance numbers? For > >> > > example, is this the throughput that you're getting in a single > >> process, > >> > or > >> > > over a number of processes? If so, how many processes? > >> > > >> > Its currently running as a single mapper processing a binary file > >> > (un-splittable). Disk throughput doesn't look to be an issue here. > >> > Production has machines of the same processing capability but > obviously > >> > more number of nodes and input files. > >> > > >> > > >> > Also, how many columns are in the records that you're loading? > >> > > >> > The row-size is small: 3 integers for PK, 2 short qualifiers, 1 > varchar > >> > qualifier > >> > > >> > What is the current (projected) time required to load the data? > >> > > >> > About 20-25 days > >> > > >> > > >> > What is the minimum allowable ingest speed to be considered > satisfactory? > >> > > >> > We would like to finish the load in less than 10-12 days. > >> > > >> > > >> > You can make things go faster by increasing the number of mappers. > >> > > >> > The input file (binary) is not-splittable, a mapper is tied to the > >> specific > >> > file. > >> > > >> > What changes did you make to the map() method? Increased logging, > >> > > performance enhancements, plugging in custom logic, something else? > >> > > >> > I added custom logic to the map() method. > >> > > >> > > >> > > >> > On Thu, Mar 5, 2015 at 7:53 AM, Nick Dimiduk <[email protected]> > wrote: > >> > > >> > > Also: how large is your cluster? You can make things go faster by > >> > > increasing the number of mappers. What changes did you make to the > >> map() > >> > > method? Increased logging, performance enhancements, plugging in > custom > >> > > logic, something else? > >> > > > >> > > On Thursday, March 5, 2015, Gabriel Reid <[email protected]> > >> wrote: > >> > > > >> > > > Hi Tulasi, > >> > > > > >> > > > Answers (and questions) inlined below: > >> > > > > >> > > > On Thu, Mar 5, 2015 at 2:41 AM Tulasi Paradarami < > >> > > > [email protected] <javascript:;>> > >> > > > wrote: > >> > > > > >> > > > > Hi, > >> > > > > > >> > > > > Here are the details of our environment: > >> > > > > Phoenix 4.3 > >> > > > > HBase 0.98.6 > >> > > > > > >> > > > > I'm loading data to a Phoenix table using the csv bulk-loader > >> (after > >> > > > making > >> > > > > some changes to the map(...) method) and it is processing about > >> > 16,000 > >> > > - > >> > > > > 20,000 rows/sec. I noticed that the bulk-loader spends upto 40% > of > >> > the > >> > > > > execution time in the following steps. > >> > > > > >> > > > > >> > > > > //... > >> > > > > csvRecord = csvLineParser.parse(value.toString()); > >> > > > > csvUpsertExecutor.execute(ImmutableList.of(csvRecord)); > >> > > > > Iterator<Pair<byte[], List<KeyValue>>> uncommittedDataIterator = > >> > > > > PhoenixRuntime.getUncommittedDataIterator(conn, true); > >> > > > > //... > >> > > > > > >> > > > > >> > > > The non-code translation of those steps is: > >> > > > 1. Parse the CSV record > >> > > > 2. Convert the contents of the CSV record into KeyValues > >> > > > > >> > > > Although it may look as though data is being written over the > wire to > >> > > > Phoenix, the execution of an upsert executor and retrieval of the > >> > > > uncommitted KeyValues is all local (in memory). The code is > >> implemented > >> > > in > >> > > > this way because JDBC is the general API used within Phoenix -- > there > >> > > isn't > >> > > > direct "convert fields to Phoenix encoding" API, although this is > >> doing > >> > > the > >> > > > equivalent operation. > >> > > > > >> > > > Could you give some more information on your performance numbers? > For > >> > > > example, is this the throughput that you're getting in a single > >> > process, > >> > > or > >> > > > over a number of processes? If so, how many processes? Also, how > many > >> > > > columns are in the records that you're loading? > >> > > > > >> > > > > >> > > > > > >> > > > > We plan to load up-to 100TB of data and overall performance of > the > >> > > > > bulk-loader is not satisfactory. > >> > > > > > >> > > > > >> > > > How many records are in that 100TB? What is the current > (projected) > >> > time > >> > > > required to load the data? What is the minimum allowable ingest > speed > >> > to > >> > > be > >> > > > considered satisfactory? > >> > > > > >> > > > - Gabriel > >> > > > > >> > > > >> > > >> >
