[ 
https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16877165#comment-16877165
 ] 

Joshua Mack commented on LUCENE-7745:
-------------------------------------

Sounds good!

Re: efficient histogram implementation in CUDA

If it helps, [this 
approach|https://scholar.google.com/scholar?cluster=4154868272073145366&hl=en&as_sdt=0,3]
 has been good for a balance between GPU performance and ease of implementation 
for work I've done in the past. If academic paywalls block you for all those 
results, it looks to also be available (presumably by the authors) on 
[researchgate|https://www.researchgate.net/publication/256674650_An_optimized_approach_to_histogram_computation_on_GPU]

The basic idea is to compute sub-histograms in each thread block with each 
thread block accumulating into the local memory. Then, when each thread block 
finishes its workload, it atomically adds the result to global memory, reducing 
the overall amount of traffic to global memory.

To increase throughput and reduce shared memory contention, the main 
contribution here is that they actually use R "replicated" sub-histograms in 
each thread block, and they offset them so that bin 0 of the 1st histogram 
falls into a different memory bank than bin 0 of the 2nd histogram, and so on 
for R histograms. Essentially, it improves throughput in the degenerate case 
where multiple threads are trying to accumulate the same histogram bin at the 
same time.

> Explore GPU acceleration
> ------------------------
>
>                 Key: LUCENE-7745
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7745
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Ishan Chattopadhyaya
>            Assignee: Ishan Chattopadhyaya
>            Priority: Major
>              Labels: gsoc2017, mentor
>         Attachments: TermDisjunctionQuery.java, gpu-benchmarks.png
>
>
> There are parts of Lucene that can potentially be speeded up if computations 
> were to be offloaded from CPU to the GPU(s). With commodity GPUs having as 
> high as 12GB of high bandwidth RAM, we might be able to leverage GPUs to 
> speed parts of Lucene (indexing, search).
> First that comes to mind is spatial filtering, which is traditionally known 
> to be a good candidate for GPU based speedup (esp. when complex polygons are 
> involved). In the past, Mike McCandless has mentioned that "both initial 
> indexing and merging are CPU/IO intensive, but they are very amenable to 
> soaking up the hardware's concurrency."
> I'm opening this issue as an exploratory task, suitable for a GSoC project. I 
> volunteer to mentor any GSoC student willing to work on this this summer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to