Hi Bryan,
Thanks for replying. I'll try to answer some of your questions point by point:
- have you tried using in-memory databases rather than test duration?
I have done some testing with in-memory databases, but I get virtually
identical results. This leads me to believe that my bottleneck is
probably not disk i/o.
- have you experimented with using a larger page size to minimize
per-page overhead?
Yes. Right now I am using a page size of 32768, which appears to
offer a modest improvement over 4096. Sizes larger than that don't
seem to much of an effect.
- have you experimented with giving a larger-than-default page cache?
Yes. Right now I am using a page cache size of 2000. Again, sizes
larger than that don't seem to have much of an effect.
- is your system CPU-bound? is there any disk io occurring at all?
Yes, it appears to be CPU-bound. I'm not sure how much disk i/o is
really occurring, but we are definitely using up all available cpu
time. Unfortunately we only use all the CPU time on a single core.
- how much garbage are you generating, what is the GC profile like?
Not too much. Garbage collections take up about 8 seconds over the
course of 5 minutes of work. The heap size is less than the maximum
allocated, so I am not thrashing. (I have had issues where index
generation uses an amazing amount of memory, but that is a story for
another day.)
- what rev of the JDK are you running, and on what operating system?
Sun JDK 6u16, on Ubuntu 9.10. I am using Derby 10.5.3.0.
- is your system multi-core? Can you throw threads/cores at this problem?
Yes. I have been doing my work on a 1.86 Ghz Core 2 and, more
recently, on a 2.8 Ghz i7. I haven't had much luck throwing more
threads/cores at the problem, however. I have tried to run some tests
where, instead of just prepping a file into a single table, I run
several threads that each prep the file into a different table.
Unfortunately, I don't actually get more throughput this way, and
while my threads are clearly running on different cores, I still only
end up using a single core's worth of cpu. I assume this means that
somewhere underneath the hood everything is being serialized.
- can you get any sort of profiling data about where the hotspots
are during your run?
Well, YourKit has helped me a bit in this regard. When I am executing
prepared statements in batches, around 90% of the program time is
spent inside the EmbedStatement.executeBatch method. Unfortunately I
don't know exactly what it is doing in there. Similarly, I can see
that when I am inserting everything using a table function ("INSERT
INTO MYTABLE (...) SELECT S.* FROM TABLE (MYFUNC ()) S") , about 90%
of the program is spent inside the method EmbedStatement.execute, of
which only 5% of that time is spent getting information from the
parser. Again, I don't know where the rest goes.
Anyway, if you or anyone else has advice on what else I can do to
improve my performance, I'd love to hear it. Thanks again for your
time.
Cheers,
Nathan
On Wed, Jan 13, 2010 at 12:30 PM, Bryan Pendleton
<[email protected]> wrote:
>> using prepared statements executed in batches, and this gives me
>> between 10 and 20 row inserts per millisecond on average. I have
>
> Wow! You're already going very fast, and you've already put a lot
> of good effort into this, so I congratulate you on how far you've
> gone already.
>
> A couple of thoughts occurred to me as I read your post:
>
> - have you tried using in-memory databases rather than test duration?
> - have you experimented with using a larger page size to minimize
> per-page overhead?
> - have you experimented with giving a larger-than-default page cache?
>
> Also, have you done any system-level analysis of your application
> during the insert:
> - is your system CPU-bound? is there any disk io occurring at all?
> - how much garbage are you generating, what is the GC profile like?
> - what rev of the JDK are you running, and on what operating system?
> - is your system multi-core? Can you throw threads/cores at this problem?
> - can you get any sort of profiling data about where the hotspots
> are during your run?
>
> Good luck with your performance work, it sounds extremely interesting!
>
> thanks,
>
> bryan
>
>