[jira] [Comment Edited] (LUCENE-7745) Explore GPU acceleration

Rinka Singh (JIRA) Wed, 03 Jul 2019 08:57:07 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16877934#comment-16877934
 ]


Rinka Singh edited comment on LUCENE-7745 at 7/3/19 3:56 PM:
-------------------------------------------------------------

{quote}The basic idea is to compute sub-histograms in each thread block with 
each thread block accumulating into the local memory. Then, when each thread 
block finishes its workload, it atomically adds the result to global memory, 
reducing the overall amount of traffic to global memory.To increase throughput 
and reduce shared memory contention, the main contribution here is that they 
actually use R "replicated" sub-histograms in each thread block, and they 
offset them so that bin 0 of the 1st histogram falls into a different memory 
bank than bin 0 of the 2nd histogram, and so on for R histograms. Essentially, 
it improves throughput in the degenerate case where multiple threads are trying 
to accumulate the same histogram bin at the same time.
{quote}
So here's what I've done/am doing:

I have a basic histogramming (including eliminating stop words) working on a 
single GPU  (I have an old Quadro 2000 with 1 GB memory) - I've tested it for a 
5MB (text file) and it seems to be working OK.

The following is how I'm implementing it - briefly.

Read a file in from command line (linux executable) into the GPU
 * convert the stream to words, chunk them into blocks
 * eliminate the stop words
 * sort/merge (including word-count) everything first inside a block and then 
across blocks - I came up with my own sort - haven't had the time to explore 
the parallel sorts out there
 * This results in a sorted histogram is held in multiple blocks in the GPU.

The advantages of this approach (to my mind) are:
 * i can scale up use the entire GPU memory.  My guess is I can create and 
manage an 8-10 GB index in a V100 (it has 32GB) - like I said, I've only tested 
with a 5 MB text file so far.
 * Easy to add fresh data into the existing histogram.  All I need to do is 
create new blocks and sort/merge them all.
 * I'm guessing this should make it easy to implement scaling across GPUs which 
means on a multi-GPU machine, I can scale to the almost the number of GPUs 
there and then of course one can setup a cluster of such machines...  This is 
far in the future though...
 * The sort is kept separate so we can experiment with various sorts and see 
which one performs best.

 

The issues are:
 * It is currently horrendously slow (I use global memory all the way and no 
optimization).  Well OK much too slow for my liking (I went over to nVidia's 
office and tested it on a K80 and it was just twice as fast as my GPU).  I'm 
currently trying to implement a shared memory version (and a few other tweaks) 
that should speed it up.
 * I have yet to do comparisons with the histogramming tools out there and so 
cannot say how much better it is.  Once I have the basic inverted index in 
place, I'll reach out to you all for the testing.
 * It is still a bit fragile - I'm still finding bugs as I test but the basic 
works.

 

Currently in process:
 * code is modified for (some) performance.  Am debugging/testing - it will 
take a while.  As of now, I feel good about what I've done but I won't know 
till I get it to work and test for performance.
 * Need to add ability to handle multiple files (I think I will postpone this 
as one can always cat the files together and pass it in - that is a pretty 
simple script that can be wrapped around the executable).
 * Need to create inverted index.
 * we'll worry about searching on the index later but that should be pretty 
trivial - well actually nothing is trivial here.

 
{quote}Re: efficient histogram implementation in CUDA

If it helps, [this 
approach|https://scholar.google.com/scholar?cluster=4154868272073145366&hl=en&as_sdt=0,3]
 has been good for a balance between GPU performance and ease of implementation 
for work I've done in the past. If academic paywalls block you for all those 
results, it looks to also be available (presumably by the authors) on 
[researchgate|https://www.researchgate.net/publication/256674650_An_optimized_approach_to_histogram_computation_on_GPU]
{quote}
 Took a quick look - they are all priced products.  I will take a look at 
researchgate sometime.

I apologize but I may not be very responsive in the next month or so as we are 
in the middle of a release at work and also my night time job (this).


was (Author: rinka):
{quote}The basic idea is to compute sub-histograms in each thread block with 
each thread block accumulating into the local memory. Then, when each thread 
block finishes its workload, it atomically adds the result to global memory, 
reducing the overall amount of traffic to global memory.To increase throughput 
and reduce shared memory contention, the main contribution here is that they 
actually use R "replicated" sub-histograms in each thread block, and they 
offset them so that bin 0 of the 1st histogram falls into a different memory 
bank than bin 0 of the 2nd histogram, and so on for R histograms. Essentially, 
it improves throughput in the degenerate case where multiple threads are trying 
to accumulate the same histogram bin at the same time.
{quote}
So here's what I've done/am doing:

I have a basic histogramming (including eliminating stop words) working on a 
single GPU  (I have an old Quadro 2000 with 1 GB memory) - I've tested it for a 
5MB (text file) and it seems to be working OK.

The following is how I'm implementing it - briefly.

Read a file in from command line (linux executable) into the GPU
 * convert the stream to words, chunk them into blocks
 * eliminate the stop words
 * sort/merge (including word-count) everything first inside a block and then 
across blocks - I came up with my own sort - haven't had the time to explore 
the parallel sorts out there
 * This results in a sorted histogram is held in multiple blocks in the GPU.

The advantages of this approach (to my mind) are:
 * i can scale up use the entire GPU memory.  My guess is I can create and 
manage an 8-10 GB index in a V100 (it has 32GB) - like I said, I've only tested 
with a 5 MB text file so far.
 * Easy to add fresh data into the existing histogram.  All I need to do is 
create new blocks and sort/merge them all.
 * I'm guessing this should make it easy to implement scaling across GPUs which 
means on a multi-GPU machine, I can scale to the almost the number of GPUs 
there and then of course one can setup a cluster of such machines...  This is 
far in the future though...
 * The sort is kept separate so we can experiment with various sorts and see 
which one performs best.

 

The issues are:
 * It is currently horrendously slow (I use global memory all the way and no 
optimization).  Well OK much too slow for my liking (I went over to nVidia's 
office and tested it on a K80 and it was just twice as fast as my GPU).  I'm 
currently trying to implement a shared memory version (and a few other tweaks) 
that should speed it up.
 * I have yet to do comparisons with the histogramming tools out there and so 
cannot say how much better it is.  Once I have the basic inverted index in 
place, I'll reach out to you all for the testing.
 * It is still a bit fragile - I'm still finding bugs as I test but the basic 
works.

 

Currently in process:
 * code is modified for (some) performance.  Am debugging/testing - it will 
take a while.  As of now, I feel good about what I've done but I won't know 
till I test for performance.
 * Need to add ability to handle multiple files (I think I will postpone this 
as one can always cat the files together and pass it in - that is a pretty 
simple script that can be wrapped around the executable).
 * Need to create inverted index.
 * we'll worry about searching on the index later but that should be pretty 
trivial - well actually nothing is trivial here.

 
{quote}Re: efficient histogram implementation in CUDA

If it helps, [this 
approach|https://scholar.google.com/scholar?cluster=4154868272073145366&hl=en&as_sdt=0,3]
 has been good for a balance between GPU performance and ease of implementation 
for work I've done in the past. If academic paywalls block you for all those 
results, it looks to also be available (presumably by the authors) on 
[researchgate|https://www.researchgate.net/publication/256674650_An_optimized_approach_to_histogram_computation_on_GPU]
{quote}
 Took a quick look - they are all priced products.  I will take a look at 
researchgate sometime.

I apologize but I may not be very responsive in the next month or so as we are 
in the middle of a release at work and also my night time job (this).

> Explore GPU acceleration
> ------------------------
>
>                 Key: LUCENE-7745
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7745
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Ishan Chattopadhyaya
>            Assignee: Ishan Chattopadhyaya
>            Priority: Major
>              Labels: gsoc2017, mentor
>         Attachments: TermDisjunctionQuery.java, gpu-benchmarks.png
>
>
> There are parts of Lucene that can potentially be speeded up if computations 
> were to be offloaded from CPU to the GPU(s). With commodity GPUs having as 
> high as 12GB of high bandwidth RAM, we might be able to leverage GPUs to 
> speed parts of Lucene (indexing, search).
> First that comes to mind is spatial filtering, which is traditionally known 
> to be a good candidate for GPU based speedup (esp. when complex polygons are 
> involved). In the past, Mike McCandless has mentioned that "both initial 
> indexing and merging are CPU/IO intensive, but they are very amenable to 
> soaking up the hardware's concurrency."
> I'm opening this issue as an exploratory task, suitable for a GSoC project. I 
> volunteer to mentor any GSoC student willing to work on this this summer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (LUCENE-7745) Explore GPU acceleration

Reply via email to