[jira] [Comment Edited] (LUCENE-7745) Explore GPU acceleration

Rinka Singh (JIRA) Wed, 19 Jun 2019 03:35:27 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863938#comment-16863938
 ]


Rinka Singh edited comment on LUCENE-7745 at 6/19/19 10:34 AM:
---------------------------------------------------------------

Hi [~mackncheesiest], All,
 A quick update. I've been going really slow (sorry 'bout that). My day job has 
consumed a lot of my time. Also, what I have working is histogramming (on text 
files) on GPUs - the problem with that is that it is horrendously slow - I use 
the GPU global memory all the way (it is just about 4-5 times faster than a 
CPU) instead of sorting in local memory. I've been trying to accelerate that 
before converting it into an inverted index. Nearly there :) you know how that 
is - the almost there syndrome... Once I get it done, I'll check it into my 
github.

Here's the lessons I learned in my journey:
 # Do all the decision making in the CPU.  See if parallelization can 
substitute for decision making - you need to think parallelization/optimization 
as part of design not as an after-thought - this is counter to what everyone 
says about optimization.  The reason is there could be SIGNIFICANT design 
changes.
 # Do as fine grained parallelism as possible.  Don't think - one cpu-thread == 
one gpu-thread.  think as parallel as possible.
 # The best metaphor I found (of working with GPUs) - think of it as an 
embedded board attached to your machine and you move data to and fro from the 
board, debug on the board.  Dump all parallel processing on the board and 
sequential on the CPU.
 # Read the nVidia Manuals (they are your best bet).  I figured it is better to 
stay with CUDA (as against OpenCL) given the wealth of CUDA info and support 
out there...
 # Writing code:
 ## explicitly think about cache memory (that's your shared memory) and 
registers (local memory) and manage them.  This is COMPLETELY different from 
writing CPU code - the compiler does this for you.
 ## Try to used const, shared and register memory as much as possible.  Avoid 
__syncthreads () if you can.
 ## Here's where the dragons lie...
 # Engineer productivity is roughly 1/10th the normal productivity (And I mean 
writing C/C++ not python).  I've written and thrown away code umpteen times - 
something that I just wouldn't need to do when writing standard code.
 # Added on 6/19: As a metaphor think about this as VHDL/Verilog programming 
(without the timing constructs - timing is not a major issue here since the 
threads on the device execute in (almost) lockstep).

Having said all this, :) I've a bunch of limitations that a regular software 
engineer will not have and have been struggling to get over them.  I've been a 
manager for way too long and find it really difficult to focus on just one 
thing (the standard ADHD that most managers eventually develop).  Also, I WAS a 
C programmer, loong ago - no, not even C++ and I just haven't had the bandwidth 
to pick C++ up and then let's not even talk about my day job pressures - I do 
this for an hour or two at night (sigh)...

I will put out everything I've done once I've crossed a milestone (a working 
accelerated histogram).  Then will modify that to do inverted indexing.

Hope this helps...  In the meantime, if you want me to review 
documents/design/thoughts/anything, please feel free to mail them to me at: 
rinka (dot) singh (at) gmail.....  At least ping me - I really don't look at 
the Apache messages and would probably miss something...

Sorry, did I mention this - I'm a COMPLETE noob at contributing to O.S. So 
please forgive me if I get the processes and systems wrong...  I'm trying to 
learn.


was (Author: rinka):
Hi [~mackncheesiest], All,
 A quick update. I've been going really slow (sorry 'bout that). My day job has 
consumed a lot of my time. Also, what I have working is histogramming (on text 
files) on GPUs - the problem with that is that it is horrendously slow - I use 
the GPU global memory all the way (it is just about 4-5 times faster than a 
CPU) instead of sorting in local memory. I've been trying to accelerate that 
before converting it into an inverted index. Nearly there :) you know how that 
is - the almost there syndrome... Once I get it done, I'll check it into my 
github.

Here's the lessons I learned in my journey:
 # Do all the decision making in the CPU.  See if parallelization can 
substitute for decision making - you need to think parallelization/optimization 
as part of design not as an after-thought - this is counter to what everyone 
says about optimization.  The reason is there could be SIGNIFICANT design 
changes.
 # Do as fine grained parallelism as possible.  Don't think - one cpu-thread == 
one gpu-thread.  think as parallel as possible.
 # The best metaphor I found (of working with GPUs) - think of it as an 
embedded board attached to your machine and you move data to and fro from the 
board, debug on the board.  Dump all parallel processing on the board and 
sequential on the CPU.
 # Read the nVidia Manuals (they are your best bet).  I figured it is better to 
stay with CUDA (as against OpenCL) given the wealth of CUDA info and support 
out there...
 # Writing code:
 ## explicitly think about cache memory (that's your shared memory) and 
registers (local memory) and manage them.  This is COMPLETELY different from 
writing CPU code - the compiler does this for you.
 ## Try to used const, shared and register memory as much as possible.  Avoid 
__syncthreads () if you can.
 ## Here's where the dragons lie...
 # Engineer productivity is roughly 1/10th the normal productivity (And I mean 
writing C/C++ not python).  I've written and thrown away code umpteen times - 
something that I just wouldn't need to do when writing standard code.

Having said all this, :) I've a bunch of limitations that a regular software 
engineer will not have and have been struggling to get over them.  I've been a 
manager for way too long and find it really difficult to focus on just one 
thing (the standard ADHD that most managers eventually develop).  Also, I WAS a 
C programmer, loong ago - no, not even C++ and I just haven't had the bandwidth 
to pick C++ up and then let's not even talk about my day job pressures - I do 
this for an hour or two at night (sigh)...

I will put out everything I've done once I've crossed a milestone (a working 
accelerated histogram).  Then will modify that to do inverted indexing.

Hope this helps...  In the meantime, if you want me to review 
documents/design/thoughts/anything, please feel free to mail them to me at: 
rinka (dot) singh (at) gmail.....  At least ping me - I really don't look at 
the Apache messages and would probably miss something...

Sorry, did I mention this - I'm a COMPLETE noob at contributing to O.S. So 
please forgive me if I get the processes and systems wrong...  I'm trying to 
learn.

> Explore GPU acceleration
> ------------------------
>
>                 Key: LUCENE-7745
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7745
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Ishan Chattopadhyaya
>            Assignee: Ishan Chattopadhyaya
>            Priority: Major
>              Labels: gsoc2017, mentor
>         Attachments: TermDisjunctionQuery.java, gpu-benchmarks.png
>
>
> There are parts of Lucene that can potentially be speeded up if computations 
> were to be offloaded from CPU to the GPU(s). With commodity GPUs having as 
> high as 12GB of high bandwidth RAM, we might be able to leverage GPUs to 
> speed parts of Lucene (indexing, search).
> First that comes to mind is spatial filtering, which is traditionally known 
> to be a good candidate for GPU based speedup (esp. when complex polygons are 
> involved). In the past, Mike McCandless has mentioned that "both initial 
> indexing and merging are CPU/IO intensive, but they are very amenable to 
> soaking up the hardware's concurrency."
> I'm opening this issue as an exploratory task, suitable for a GSoC project. I 
> volunteer to mentor any GSoC student willing to work on this this summer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (LUCENE-7745) Explore GPU acceleration

Reply via email to