[
https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16877934#comment-16877934
]
Rinka Singh edited comment on LUCENE-7745 at 7/3/19 3:56 PM:
-------------------------------------------------------------
{quote}The basic idea is to compute sub-histograms in each thread block with
each thread block accumulating into the local memory. Then, when each thread
block finishes its workload, it atomically adds the result to global memory,
reducing the overall amount of traffic to global memory.To increase throughput
and reduce shared memory contention, the main contribution here is that they
actually use R "replicated" sub-histograms in each thread block, and they
offset them so that bin 0 of the 1st histogram falls into a different memory
bank than bin 0 of the 2nd histogram, and so on for R histograms. Essentially,
it improves throughput in the degenerate case where multiple threads are trying
to accumulate the same histogram bin at the same time.
{quote}
So here's what I've done/am doing:
I have a basic histogramming (including eliminating stop words) working on a
single GPU (I have an old Quadro 2000 with 1 GB memory) - I've tested it for a
5MB (text file) and it seems to be working OK.
The following is how I'm implementing it - briefly.
Read a file in from command line (linux executable) into the GPU
* convert the stream to words, chunk them into blocks
* eliminate the stop words
* sort/merge (including word-count) everything first inside a block and then
across blocks - I came up with my own sort - haven't had the time to explore
the parallel sorts out there
* This results in a sorted histogram is held in multiple blocks in the GPU.
The advantages of this approach (to my mind) are:
* i can scale up use the entire GPU memory. My guess is I can create and
manage an 8-10 GB index in a V100 (it has 32GB) - like I said, I've only tested
with a 5 MB text file so far.
* Easy to add fresh data into the existing histogram. All I need to do is
create new blocks and sort/merge them all.
* I'm guessing this should make it easy to implement scaling across GPUs which
means on a multi-GPU machine, I can scale to the almost the number of GPUs
there and then of course one can setup a cluster of such machines... This is
far in the future though...
* The sort is kept separate so we can experiment with various sorts and see
which one performs best.
The issues are:
* It is currently horrendously slow (I use global memory all the way and no
optimization). Well OK much too slow for my liking (I went over to nVidia's
office and tested it on a K80 and it was just twice as fast as my GPU). I'm
currently trying to implement a shared memory version (and a few other tweaks)
that should speed it up.
* I have yet to do comparisons with the histogramming tools out there and so
cannot say how much better it is. Once I have the basic inverted index in
place, I'll reach out to you all for the testing.
* It is still a bit fragile - I'm still finding bugs as I test but the basic
works.
Currently in process:
* code is modified for (some) performance. Am debugging/testing - it will
take a while. As of now, I feel good about what I've done but I won't know
till I get it to work and test for performance.
* Need to add ability to handle multiple files (I think I will postpone this
as one can always cat the files together and pass it in - that is a pretty
simple script that can be wrapped around the executable).
* Need to create inverted index.
* we'll worry about searching on the index later but that should be pretty
trivial - well actually nothing is trivial here.
{quote}Re: efficient histogram implementation in CUDA
If it helps, [this
approach|https://scholar.google.com/scholar?cluster=4154868272073145366&hl=en&as_sdt=0,3]
has been good for a balance between GPU performance and ease of implementation
for work I've done in the past. If academic paywalls block you for all those
results, it looks to also be available (presumably by the authors) on
[researchgate|https://www.researchgate.net/publication/256674650_An_optimized_approach_to_histogram_computation_on_GPU]
{quote}
Took a quick look - they are all priced products. I will take a look at
researchgate sometime.
I apologize but I may not be very responsive in the next month or so as we are
in the middle of a release at work and also my night time job (this).
was (Author: rinka):
{quote}The basic idea is to compute sub-histograms in each thread block with
each thread block accumulating into the local memory. Then, when each thread
block finishes its workload, it atomically adds the result to global memory,
reducing the overall amount of traffic to global memory.To increase throughput
and reduce shared memory contention, the main contribution here is that they
actually use R "replicated" sub-histograms in each thread block, and they
offset them so that bin 0 of the 1st histogram falls into a different memory
bank than bin 0 of the 2nd histogram, and so on for R histograms. Essentially,
it improves throughput in the degenerate case where multiple threads are trying
to accumulate the same histogram bin at the same time.
{quote}
So here's what I've done/am doing:
I have a basic histogramming (including eliminating stop words) working on a
single GPU (I have an old Quadro 2000 with 1 GB memory) - I've tested it for a
5MB (text file) and it seems to be working OK.
The following is how I'm implementing it - briefly.
Read a file in from command line (linux executable) into the GPU
* convert the stream to words, chunk them into blocks
* eliminate the stop words
* sort/merge (including word-count) everything first inside a block and then
across blocks - I came up with my own sort - haven't had the time to explore
the parallel sorts out there
* This results in a sorted histogram is held in multiple blocks in the GPU.
The advantages of this approach (to my mind) are:
* i can scale up use the entire GPU memory. My guess is I can create and
manage an 8-10 GB index in a V100 (it has 32GB) - like I said, I've only tested
with a 5 MB text file so far.
* Easy to add fresh data into the existing histogram. All I need to do is
create new blocks and sort/merge them all.
* I'm guessing this should make it easy to implement scaling across GPUs which
means on a multi-GPU machine, I can scale to the almost the number of GPUs
there and then of course one can setup a cluster of such machines... This is
far in the future though...
* The sort is kept separate so we can experiment with various sorts and see
which one performs best.
The issues are:
* It is currently horrendously slow (I use global memory all the way and no
optimization). Well OK much too slow for my liking (I went over to nVidia's
office and tested it on a K80 and it was just twice as fast as my GPU). I'm
currently trying to implement a shared memory version (and a few other tweaks)
that should speed it up.
* I have yet to do comparisons with the histogramming tools out there and so
cannot say how much better it is. Once I have the basic inverted index in
place, I'll reach out to you all for the testing.
* It is still a bit fragile - I'm still finding bugs as I test but the basic
works.
Currently in process:
* code is modified for (some) performance. Am debugging/testing - it will
take a while. As of now, I feel good about what I've done but I won't know
till I test for performance.
* Need to add ability to handle multiple files (I think I will postpone this
as one can always cat the files together and pass it in - that is a pretty
simple script that can be wrapped around the executable).
* Need to create inverted index.
* we'll worry about searching on the index later but that should be pretty
trivial - well actually nothing is trivial here.
{quote}Re: efficient histogram implementation in CUDA
If it helps, [this
approach|https://scholar.google.com/scholar?cluster=4154868272073145366&hl=en&as_sdt=0,3]
has been good for a balance between GPU performance and ease of implementation
for work I've done in the past. If academic paywalls block you for all those
results, it looks to also be available (presumably by the authors) on
[researchgate|https://www.researchgate.net/publication/256674650_An_optimized_approach_to_histogram_computation_on_GPU]
{quote}
Took a quick look - they are all priced products. I will take a look at
researchgate sometime.
I apologize but I may not be very responsive in the next month or so as we are
in the middle of a release at work and also my night time job (this).
> Explore GPU acceleration
> ------------------------
>
> Key: LUCENE-7745
> URL: https://issues.apache.org/jira/browse/LUCENE-7745
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Ishan Chattopadhyaya
> Assignee: Ishan Chattopadhyaya
> Priority: Major
> Labels: gsoc2017, mentor
> Attachments: TermDisjunctionQuery.java, gpu-benchmarks.png
>
>
> There are parts of Lucene that can potentially be speeded up if computations
> were to be offloaded from CPU to the GPU(s). With commodity GPUs having as
> high as 12GB of high bandwidth RAM, we might be able to leverage GPUs to
> speed parts of Lucene (indexing, search).
> First that comes to mind is spatial filtering, which is traditionally known
> to be a good candidate for GPU based speedup (esp. when complex polygons are
> involved). In the past, Mike McCandless has mentioned that "both initial
> indexing and merging are CPU/IO intensive, but they are very amenable to
> soaking up the hardware's concurrency."
> I'm opening this issue as an exploratory task, suitable for a GSoC project. I
> volunteer to mentor any GSoC student willing to work on this this summer.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]