Hi Brian - Re. HDFS, are you aware that Chapel already has some (experimental) HDFS support? See http://chapel.cray.com/docs/latest/modules/standard/HDFS.html
That might make it easier to do the I/O patterns you were hoping for... In addition, I *think* you might be able to get HDFSiter to read lines (if you use a regular expression matching a line). We do have (again experimental) forall support for HDFSiters. -michael On 5/19/15, 9:45 AM, "Brian Guarraci" <[email protected]> wrote: >answers / comments inline > >On Fri, May 15, 2015 at 8:54 AM, Michael Ferguson ><[email protected]> wrote: > >Hi Brian - > >See below for lots of feedback on the program you posted... > >On 5/14/15, 1:19 PM, "Brian Guarraci" <[email protected]> wrote: > >>I've managed to create a simple example which seems to encapsulate the >>general, but ultra-simplified, access patterns that my search index is >>doing. The example runs about as slow and is basically not doing much >>other than allocating nodes on >> a linked list and incrementing an atomic. I'd really love to know what >>naive thing I'm doing that turns out to be really slow - and what I can >>do about it. >> >> >>https://github.com/briangu/libev-examples/blob/_bg_p2/chapel/crosstalk.ch >>p >>l > > >I'd like to try running this on some other systems to understand if they >have a similar performance issue. Are you using the GASNet UDP conduit? >Do you have a runnable version of the simplified program? Also, >if you kept the same skeleton of the program but changed the type >of word to int - do you still see performance problems? > > > >I've tried reverting to native ints, uints in all experiments. No >difference. I too was curious if the ARM processor was having issues >with the width. > > >If not, >the problems might mostly stem from the string implementation which >we are actively working on... (You could also try using tuples and/or >records to represent fixed-length strings as a sequence of bytes/uint(8)s. > > > > >thats interesting. I will give it a try. > > >As it is, I believe this program will leak memory in strings because >of problems that we're working on fixing). > > > > > >Got it. Thanks! > > >Also, I'm a little bit surprised that this program is totally serial. >We have some draft work to do 'forall file.lines()'. Would that be useful >to you? On a related point - why are you using atomic ints if the program >is totally serial? Was the full program working in parallel? Or were >you working up to a parallel version? > > > > > >As I mentioned above, at this point I have reduced everything down to the >simplest form in an effort to clarify the slowness. The idea state is >that reads are parallelized, enqueuing of indexing is multi-threaded, and >writers are single-threaded on a > per-partition basis. The search system I'm most familiar with uses a >single writer and multiple readers and still can easily index 10s of >thousands of records per second per partition. > > >In addition, I'm thinking about making the I/O system support: > * efficiently reading from a remote file > * creating a channel to queue up work between two or more locales >Would those features help? > > > > >Yes, I think so. > > >Using HDFS is my current plan, but having the ability to read a locale's >local file from another locale would be awesome. Personally, I'd use a >URI scheme to address it. Maybe use locale as the scheme? Where >"locale://0/foo.txt" could read foo.txt from > CWD of locale 0. I suppose you could also address it in the open file >command as well, removing the need to synthesize strings. > > >I think a generic pub-sub system for distributing tasks would be very >useful. It may also be more useful if you allow customized hash >partitioning functions, so that there's basically an implied partitioning >scheme and consumers could subscribe to one > or more virtual queues. This would also allow for replicas on a hash >partition. More generically, I think it would be very useful to have a >way for the publisher to have a queue of the responses (possibly the same >type of system with multiple consumers). > I could imagine taking a TCP request, publishing the work item, getting >the result, and then responding to the TCP request with the response. I >could also imagine more standard use-cases like fanning out image >processing. > > > >I also happen to know a fair amount about information retrieval... >so I think I have an idea what you're trying to do here. I'd be >happy to discuss with you other ways it might make sense to >make a distributed indexer (in terms of overall program architecture). >But, I won't start that discussion unless you opt in, since it's >not what you were asking about... > > > > >Sure! What I've done so far is pretty basic so I'm curious what you >think would work best in your experience. > > > >As a pedantic exercise, I'm going to go through your little program >(with some hints from your other github code) and point out which >communication is occurring where. I'm not saying that you do or >don't already know these details. Likewise, I'm not saying that the >resulting performance is where we would like it to be. Just trying >to help you (and anybody else reading this) understand what's going >on with the current implementation. For that reason, I'll probably >answer questions you didn't mean to ask... > >> >>use Logging, Memory, IO, Partitions, Time; >> >> >>class Node { >> var word: string; >> var next: Node; >>} >> >> >>class PartitionInfo { >> var head: Node; >> var count: atomic int; >>} >> >> >>class WordIndex { >> var wordIndex: [0..Partitions.size-1] PartitionInfo; >> >> >> proc WordIndex() { >> for i in wordIndex.domain { >> on Partitions[i] { >> writeln("adding ", i); >> wordIndex[i] = new PartitionInfo(); >> } >> } >> } >> >> >> proc indexWord(word: string) { >> var partition = partitionForWord(word); >> var info = wordIndex[partition]; > >These should not cause communication because you're still running >everything on Locale 0 at this point, and the WordIndex >class instance as well as its wordIndex array member are >all also on Locale 0. > >Now info is a PartitionInfo class instance. > >In WordIndex(), you allocate each PartitionInfo >object on different nodes, so this on statement should actually >do something useful. Furthermore, the 'head' and 'count' fields >should be local at that point, but the compiler might still >generate code that could communicate... This is a case where >the 'local field' pragma Brad brought up before might help >create more optimal code. > >> on info { > >Note that the body of this 'on' statement uses two variables >from the function: info and word. > >Remote value forwarding (an optimization within the Chapel compiler) >should pass info (a pointer to a class object) along with the message >starting the body of this 'on' statement. If that optimization is >disabled, >the generated code would probably communicate back to Locale 0 to read the >value of the variable 'info' (since info is a variable storing >a pointer to a class instance, and info resides on Locale 0). >The same reasoning applies to 'word' which is a pointer to string data. >My experiments with the current implementation indicate that >the string pointer is included in remote value forwarding but that >the string data will be communicated when you use it. > > >> info.head = new Node(word, info.head); > >Naturally, this call allocates heap memory, which can be slow... >If show allocations are the problem, life usually >gets better if you're able to use tcmalloc. (e.g. >build with CHPL_MEM=tcmalloc). > > >The compiler probably assumes that it could be running >on a different locale than where 'info' is stored, so >I wouldn't be surprised if that adds some significant >overhead (that I believe you can reduce with 'local field'). > >> info.count.add(1); > >Here, the compiler probably doesn't know that this statement >will run on the same locale where 'info.count' is stored. >Of course it is in fact, and the 'local field' pragma would >probably help. But, assuming that the compiler doesn't know >that, the call info.count.add(1) will turn into something >like this: > var to_add = 1; > on info.count { > some_atomic_add_intrinsic(local pointer to info.count, to_add); > } > >That means that your like info.count.add(1) will turn into >a function call and probably pack and unpack arguments for the >on statement. All this adds overhead when it's not needed. > >In the end, this statement will be a local operation, but I don't >understand why it's atomic at all. If you were running >this in parallel, you'd run into problems with race >conditions on info.head; so you'd want to compare-and-swap >there. But right now we can't compare-and-swap class instances >when building for more than one locale. So instead you'd >probably fall back on protecting both of these fields >with a lock (you'd use a sync variable for that in Chapel). > > > > >Yes, the code is basically representative of the type of work I would >like to do, not necessarily correct. I noticed I couldn't do atomic >class pointers, so that resulting in a bunch of ad-hoc code to roughly >get the right behavior. Since it was all > slow, I didn't go for correctness quite yet. > > > >I would probably use an array for this and just expand >the array periodically. I'd probably also try to send batches >of words in the 'on' statements instead of just one, in order >to better hide the latency of doing the 'on' statement. > > > > > >My first set of major optimization experiments used batching. It didn't >help. That could be from something else though. Generally agree >batching would be nice here. > > > >> } >> } >>} >> > >> >>proc main() { >> initPartitions(); > >For anybody not looking also at the github code, >the partitions are stored declared like this: > > config var partitionDimensions = 16; //Locales.size; > var Partitions: [0..partitionDimensions-1] locale; > >That means that they are in an array stored on locale 0. >(Also, the code populating this Partitions array is > strange and probably non-optimal, but that's not what > you asked about) > > > > >The reason it's strange is that I found that the order in which the >hostnames were mapped to locales seemed inconsistent. If you use the >partition ID and hostname as part of your calculation, e.g. for >identifying a filename, then you would get different > results over time. The code there is a quick hack to order by hostname >so that given a set of hostnames the partitions => hostname map is >consistent. > > >Personally, I would probably not use a Partitions array >at all and instead write a function to go from word >hash -> locale like this (e.g.) > > proc localeForWord(word: string): locale { > return Locales[ genHashKey32(word) % Locales.size ]; > } > >But, you presumably have some reason to want to possibly >have multiple partitions on a particular locale. I would >probably do that with something that can compute the >partition/locale in a consistent manner without using >an array at all. The reason is that the array uses cause >communication. An alternative would be to replicate the >Partitions array on all locales once it's set up. One >way to do that would be to use the ReplicatedDist() >distribution, which you can see shown in > test/{release/}examples/primers/distributions.chpl > > > > > >This makes sense in terms of communication. I think the reason I created >the partitions array was to have a place to hold the PartitionInfo. It >was immediately obvious when I started this how each locale could have >its own PartitionInfo without having > a top-level reference. Now that I'm more familiar with Chapel, I think >this should be more straightforward - and worthwhile since you say >there's a big cost. > > >In any case, in this actual program these effects are >limited to the program initialization... > >> >> var wordIndex = new WordIndex(); > >This allocates heap memory of course.. > >> >> var t: Timer; >> t.start(); >> >> >> var infile = open("words.txt", iomode.r); >> var reader = infile.reader(); >> var word: string; >> while (reader.readln(word)) { > >As I mentioned, we'd like to provide 'forall file.lines()' >which would probably be nice here. Also the current implementation >has problems with string memory leaks that we're actively working >on. The read string or a copy of it might be leaked. If you just >wanted a serial 'for all lines in file', you can do that with > > for word in infile.lines() { > wordIndex.indexWord(word); > } > >file.lines() is demonstrated in > test/{release/}examples/primers/fileIO.chpl > > >> wordIndex.indexWord(word); >> } >> >> >> t.stop(); >> timing("indexing complete in ",t.elapsed(TimeUnits.microseconds), " >>microseconds"); >>} > > > > > > >Generally, it seems like I need to go to a more standard approach (for >search on Lucene at least) of allocating large blocks of memory and >indexing into them. Don't use strings (right now until leak is fixed). >Fix partitioning so that there's no need > for a top-level reference array - my guess is that this is really bad >perf. wise. And fix the race conditions around class pointers. > > >Thanks for the feedback! > > > > >Thanks for trying all of this out, > >-michael > > > > > > > ------------------------------------------------------------------------------ One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y _______________________________________________ Chapel-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/chapel-developers
