Thanks James; here is the JIRA with the code snippet for direct encoding:
https://issues.apache.org/jira/browse/PHOENIX-1737


On Sat, Mar 14, 2015 at 10:32 AM, James Taylor <[email protected]>
wrote:

> Please file a JIRA, Tulasi. This is a fair point. I'm surprised it's
> 4x faster. Can you share your code for the direct encoding path there
> too? Are you still doing the CSV parsing in your code? Also, are you
> sorting the KeyValues or do you know that they'll be in row key order
> in the CSV file?
>
> In the meantime, I'll cleanup the original patch. I have one more
> improvement I can make that's pretty straightforward too.
>
> Thanks,
> James
>
> On Fri, Mar 13, 2015 at 2:54 PM, Tulasi Paradarami
> <[email protected]> wrote:
> >>
> >> I don't know of any benchmarks vs. HBase bulk loader. Would be
> interesting,
> >> if you could come up with an apples-to-apples test.
> >
> >
> > I did some testing to get an apples-to-apples comparison between the two
> > options.
> >
> > For 10 million rows (primary key is a 3 column composite key with 3
> column
> > qualifiers):
> > JDBC bulk-loading: 430 sec (after applying PHOENIX-1711 patch)
> > Direct Phoenix encoding: 112 sec
> >
> > Using direct encoding path executes in 1/4th the JDBC time and I think,
> the
> > difference is significant enough to provide APIs for direct Phoenix
> > encoding in the bulk-loader.
> >
> > Thanks
> >
> >
> > On Thu, Mar 5, 2015 at 2:13 PM, Nick Dimiduk <[email protected]> wrote:
> >
> >> I don't know of any benchmarks vs. HBase bulk loader. Would be
> interesting,
> >> if you could come up with an apples-to-apples test.
> >>
> >> 100TB binary file cannot be partitioned at all? You're always bound to a
> >> single process. Bummer. I guess plan B could be pre-processing the
> binary
> >> file into something splittable. You'll cover the data twice, but if
> Phoenix
> >> encoding really is the current bottleneck, as your mail indicates, then
> >> separating the decoding of the binary file from encoding of the Phoenix
> >> output should allow for parallelizing the second step and improve the
> state
> >> of things.
> >>
> >> Mean time, would be good to look at perf improvements of the Phoenix
> >> encoding step. Any volunteers lurking about?
> >>
> >> -n
> >>
> >> On Thu, Mar 5, 2015 at 1:08 PM, Tulasi Paradarami <
> >> [email protected]> wrote:
> >>
> >> > Gabriel, Nick, thanks for your inputs. My comments below.
> >> >
> >> > Although it may look as though data is being written over the wire to
> >> > > Phoenix, the execution of an upsert executor and retrieval of the
> >> > > uncommitted KeyValues is all local (in memory). The code is
> implemented
> >> > in
> >> > > this way because JDBC is the general API used within Phoenix --
> there
> >> > isn't
> >> > > direct "convert fields to Phoenix encoding" API, although this is
> doing
> >> > the
> >> > > equivalent operation.
> >> >
> >> > I understand, data processing is in memory but performance can be
> >> improved
> >> > if there is a direct conversion to Phoenix encoding.
> >> > Are there any performance comparison results between phoenix & hbase
> >> > bulk-loader?
> >> >
> >> > Could you give some more information on your performance numbers? For
> >> > > example, is this the throughput that you're getting in a single
> >> process,
> >> > or
> >> > > over a number of processes? If so, how many processes?
> >> >
> >> > Its currently running as a single mapper processing a binary file
> >> > (un-splittable). Disk throughput doesn't look to be an issue here.
> >> > Production has machines of the same processing capability but
> obviously
> >> > more number of nodes and input files.
> >> >
> >> >
> >> > Also, how many columns are in the records that you're loading?
> >> >
> >> > The row-size is small: 3 integers for PK, 2 short qualifiers, 1
> varchar
> >> > qualifier
> >> >
> >> > What is the current (projected) time required to load the data?
> >> >
> >> > About 20-25 days
> >> >
> >> >
> >> > What is the minimum allowable ingest speed to be considered
> satisfactory?
> >> >
> >> > We would like to finish the load in less than 10-12 days.
> >> >
> >> >
> >> > You can make things go faster by increasing the number of mappers.
> >> >
> >> > The input file (binary) is not-splittable, a mapper is tied to the
> >> specific
> >> > file.
> >> >
> >> > What changes did you make to the map() method? Increased logging,
> >> > > performance enhancements, plugging in custom logic, something else?
> >> >
> >> > I added custom logic to the map() method.
> >> >
> >> >
> >> >
> >> > On Thu, Mar 5, 2015 at 7:53 AM, Nick Dimiduk <[email protected]>
> wrote:
> >> >
> >> > > Also: how large is your cluster? You can make things go faster by
> >> > > increasing the number of mappers. What changes did you make to the
> >> map()
> >> > > method? Increased logging, performance enhancements, plugging in
> custom
> >> > > logic, something else?
> >> > >
> >> > > On Thursday, March 5, 2015, Gabriel Reid <[email protected]>
> >> wrote:
> >> > >
> >> > > > Hi Tulasi,
> >> > > >
> >> > > > Answers (and questions) inlined below:
> >> > > >
> >> > > > On Thu, Mar 5, 2015 at 2:41 AM Tulasi Paradarami <
> >> > > > [email protected] <javascript:;>>
> >> > > > wrote:
> >> > > >
> >> > > > > Hi,
> >> > > > >
> >> > > > > Here are the details of our environment:
> >> > > > > Phoenix 4.3
> >> > > > > HBase 0.98.6
> >> > > > >
> >> > > > > I'm loading data to a Phoenix table using the csv bulk-loader
> >> (after
> >> > > > making
> >> > > > > some changes to the map(...) method) and it is processing about
> >> > 16,000
> >> > > -
> >> > > > > 20,000 rows/sec. I noticed that the bulk-loader spends upto 40%
> of
> >> > the
> >> > > > > execution time in the following steps.
> >> > > >
> >> > > >
> >> > > > > //...
> >> > > > > csvRecord = csvLineParser.parse(value.toString());
> >> > > > > csvUpsertExecutor.execute(ImmutableList.of(csvRecord));
> >> > > > > Iterator<Pair<byte[], List<KeyValue>>> uncommittedDataIterator =
> >> > > > > PhoenixRuntime.getUncommittedDataIterator(conn, true);
> >> > > > > //...
> >> > > > >
> >> > > >
> >> > > > The non-code translation of those steps is:
> >> > > > 1. Parse the CSV record
> >> > > > 2. Convert the contents of the CSV record into KeyValues
> >> > > >
> >> > > > Although it may look as though data is being written over the
> wire to
> >> > > > Phoenix, the execution of an upsert executor and retrieval of the
> >> > > > uncommitted KeyValues is all local (in memory). The code is
> >> implemented
> >> > > in
> >> > > > this way because JDBC is the general API used within Phoenix --
> there
> >> > > isn't
> >> > > > direct "convert fields to Phoenix encoding" API, although this is
> >> doing
> >> > > the
> >> > > > equivalent operation.
> >> > > >
> >> > > > Could you give some more information on your performance numbers?
> For
> >> > > > example, is this the throughput that you're getting in a single
> >> > process,
> >> > > or
> >> > > > over a number of processes? If so, how many processes? Also, how
> many
> >> > > > columns are in the records that you're loading?
> >> > > >
> >> > > >
> >> > > > >
> >> > > > > We plan to load up-to 100TB of data and overall performance of
> the
> >> > > > > bulk-loader is not satisfactory.
> >> > > > >
> >> > > >
> >> > > > How many records are in that 100TB? What is the current
> (projected)
> >> > time
> >> > > > required to load the data? What is the minimum allowable ingest
> speed
> >> > to
> >> > > be
> >> > > > considered satisfactory?
> >> > > >
> >> > > > - Gabriel
> >> > > >
> >> > >
> >> >
> >>
>

Reply via email to