[
https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863938#comment-16863938
]
Rinka Singh edited comment on LUCENE-7745 at 6/19/19 10:34 AM:
---------------------------------------------------------------
Hi [~mackncheesiest], All,
A quick update. I've been going really slow (sorry 'bout that). My day job has
consumed a lot of my time. Also, what I have working is histogramming (on text
files) on GPUs - the problem with that is that it is horrendously slow - I use
the GPU global memory all the way (it is just about 4-5 times faster than a
CPU) instead of sorting in local memory. I've been trying to accelerate that
before converting it into an inverted index. Nearly there :) you know how that
is - the almost there syndrome... Once I get it done, I'll check it into my
github.
Here's the lessons I learned in my journey:
# Do all the decision making in the CPU. See if parallelization can
substitute for decision making - you need to think parallelization/optimization
as part of design not as an after-thought - this is counter to what everyone
says about optimization. The reason is there could be SIGNIFICANT design
changes.
# Do as fine grained parallelism as possible. Don't think - one cpu-thread ==
one gpu-thread. think as parallel as possible.
# The best metaphor I found (of working with GPUs) - think of it as an
embedded board attached to your machine and you move data to and fro from the
board, debug on the board. Dump all parallel processing on the board and
sequential on the CPU.
# Read the nVidia Manuals (they are your best bet). I figured it is better to
stay with CUDA (as against OpenCL) given the wealth of CUDA info and support
out there...
# Writing code:
## explicitly think about cache memory (that's your shared memory) and
registers (local memory) and manage them. This is COMPLETELY different from
writing CPU code - the compiler does this for you.
## Try to used const, shared and register memory as much as possible. Avoid
__syncthreads () if you can.
## Here's where the dragons lie...
# Engineer productivity is roughly 1/10th the normal productivity (And I mean
writing C/C++ not python). I've written and thrown away code umpteen times -
something that I just wouldn't need to do when writing standard code.
# Added on 6/19: As a metaphor think about this as VHDL/Verilog programming
(without the timing constructs - timing is not a major issue here since the
threads on the device execute in (almost) lockstep).
Having said all this, :) I've a bunch of limitations that a regular software
engineer will not have and have been struggling to get over them. I've been a
manager for way too long and find it really difficult to focus on just one
thing (the standard ADHD that most managers eventually develop). Also, I WAS a
C programmer, loong ago - no, not even C++ and I just haven't had the bandwidth
to pick C++ up and then let's not even talk about my day job pressures - I do
this for an hour or two at night (sigh)...
I will put out everything I've done once I've crossed a milestone (a working
accelerated histogram). Then will modify that to do inverted indexing.
Hope this helps... In the meantime, if you want me to review
documents/design/thoughts/anything, please feel free to mail them to me at:
rinka (dot) singh (at) gmail..... At least ping me - I really don't look at
the Apache messages and would probably miss something...
Sorry, did I mention this - I'm a COMPLETE noob at contributing to O.S. So
please forgive me if I get the processes and systems wrong... I'm trying to
learn.
was (Author: rinka):
Hi [~mackncheesiest], All,
A quick update. I've been going really slow (sorry 'bout that). My day job has
consumed a lot of my time. Also, what I have working is histogramming (on text
files) on GPUs - the problem with that is that it is horrendously slow - I use
the GPU global memory all the way (it is just about 4-5 times faster than a
CPU) instead of sorting in local memory. I've been trying to accelerate that
before converting it into an inverted index. Nearly there :) you know how that
is - the almost there syndrome... Once I get it done, I'll check it into my
github.
Here's the lessons I learned in my journey:
# Do all the decision making in the CPU. See if parallelization can
substitute for decision making - you need to think parallelization/optimization
as part of design not as an after-thought - this is counter to what everyone
says about optimization. The reason is there could be SIGNIFICANT design
changes.
# Do as fine grained parallelism as possible. Don't think - one cpu-thread ==
one gpu-thread. think as parallel as possible.
# The best metaphor I found (of working with GPUs) - think of it as an
embedded board attached to your machine and you move data to and fro from the
board, debug on the board. Dump all parallel processing on the board and
sequential on the CPU.
# Read the nVidia Manuals (they are your best bet). I figured it is better to
stay with CUDA (as against OpenCL) given the wealth of CUDA info and support
out there...
# Writing code:
## explicitly think about cache memory (that's your shared memory) and
registers (local memory) and manage them. This is COMPLETELY different from
writing CPU code - the compiler does this for you.
## Try to used const, shared and register memory as much as possible. Avoid
__syncthreads () if you can.
## Here's where the dragons lie...
# Engineer productivity is roughly 1/10th the normal productivity (And I mean
writing C/C++ not python). I've written and thrown away code umpteen times -
something that I just wouldn't need to do when writing standard code.
Having said all this, :) I've a bunch of limitations that a regular software
engineer will not have and have been struggling to get over them. I've been a
manager for way too long and find it really difficult to focus on just one
thing (the standard ADHD that most managers eventually develop). Also, I WAS a
C programmer, loong ago - no, not even C++ and I just haven't had the bandwidth
to pick C++ up and then let's not even talk about my day job pressures - I do
this for an hour or two at night (sigh)...
I will put out everything I've done once I've crossed a milestone (a working
accelerated histogram). Then will modify that to do inverted indexing.
Hope this helps... In the meantime, if you want me to review
documents/design/thoughts/anything, please feel free to mail them to me at:
rinka (dot) singh (at) gmail..... At least ping me - I really don't look at
the Apache messages and would probably miss something...
Sorry, did I mention this - I'm a COMPLETE noob at contributing to O.S. So
please forgive me if I get the processes and systems wrong... I'm trying to
learn.
> Explore GPU acceleration
> ------------------------
>
> Key: LUCENE-7745
> URL: https://issues.apache.org/jira/browse/LUCENE-7745
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Ishan Chattopadhyaya
> Assignee: Ishan Chattopadhyaya
> Priority: Major
> Labels: gsoc2017, mentor
> Attachments: TermDisjunctionQuery.java, gpu-benchmarks.png
>
>
> There are parts of Lucene that can potentially be speeded up if computations
> were to be offloaded from CPU to the GPU(s). With commodity GPUs having as
> high as 12GB of high bandwidth RAM, we might be able to leverage GPUs to
> speed parts of Lucene (indexing, search).
> First that comes to mind is spatial filtering, which is traditionally known
> to be a good candidate for GPU based speedup (esp. when complex polygons are
> involved). In the past, Mike McCandless has mentioned that "both initial
> indexing and merging are CPU/IO intensive, but they are very amenable to
> soaking up the hardware's concurrency."
> I'm opening this issue as an exploratory task, suitable for a GSoC project. I
> volunteer to mentor any GSoC student willing to work on this this summer.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]