Paul, Thanks for raising a lot of interesting points which I’ll attempt to address inline:
On 01/04/2014 16:35, "Paul Houle" <[email protected]> wrote: >I've been looking at this and here is what I think. > >This looks better than what I am using now in Infovore (I might even >try snapping it in) but probably isn't as good, performance-wise, as >what I'd like to have. I did say they were experimental, they certainly aren’t production ready yet. They have a bunch of known bugs, don’t necessarily scale particularly well for some cases and don’t cover every possible use case. Once this code reaches the Jena code base then contributions will of course be welcome though I’d prefer not to start getting into that right now because it complicates the IP Clearance process. > >Something I learned back in grad school is that if you have a matrix >of 100 million floating point numbers you can do a lot of flops on >them in the time it takes to turn them into base ten and back, >especially if you're using Java and you have to puff up ASCII to >UTF-16 and back. NodeWritable already takes advantage of Hadoops WritableComparator features so that many comparisons are done purely on the binary serialised form of the Node without ever having to marshal back to an actual Jena Node. This means the sort and group parts of reduction can be done very efficiently. Having similar functionality for the other primitive types is on the list of proposed future work but the internal POC that these libraries originated from was culled before I ever got that far. > >A lot of the jobs I am interested in doing involve a lot of sums and >counts so storing numbers in a native representation will make a big >difference. There is nothing stopping you using the existing Hadoop native types like IntWritable here or extending/decorating the input formats (or adding a mapper) which converts into native representations as a first step in your job pipeline. >Similarly there are a lot of situations where you can >treat a IRI as completely opaque (you read it in and write it out >without ever looking at it, or if you do look at it you are equality >testing it or pattern matching) In cases like that there is a lot to >gain from making as few memory allocations copies as possible, so I >don't want something that "wraps" a Jena Node but I would definitely >like to get the Jena Node (maybe even lazy evaled) if I want it. Short term then the improvement would be to more lazily evaluate into nodes which can likely be easily done. Longer term there is definitely scope for having an alternative binary representation of Nodes based on something like TDBs inlining approach where native values would be stored directly in a byte sequence and Node’s created lazily as and when needed. > >IRI representation is also an interesting question. I think almost >all triple stores keep a dictionary of "namespaces" (which just might >be observed prefixes) and represent IRIs as a namespace pointer plus a >suffix. Gzip and other compression algorithms eat some of the >overhead but you get even better results if you compress prefixes >first and then gzip. Again nothing stops you configuring Hadoop to compress the map and/or reduce phase outputs. GZip will work perfectly nicely on the binary representations used currently. Doing dictionary encoding in a distributed environment is difficult because it implies either some level of centralised coordination or a good hashing scheme. These are entirely possible but obviously requires a lot more design and development effort. > >I did some work two years ago where I was using Pig to do >pagerank-like computations on my home cluster and found I could get >20x speedups by taking the cumulative probability distribution of the >IRIs and representing them with variable length codes. In some data >sets I see that the "a" predicate is 10% of the predicates, for >instance, and in a case like that you would ideally code it with 4 >bits or so. It becomes a big pain if you want to look at the IRIs >(gotta join) but often you don't need to look at the IRIs. (Since >then I also discovered a scalable and stable algorithm for the >cumulative probability distribution) Anything is possible long term but it requires someone to implement it > >At this point splitability is not a big concern for me, largely >because I am using multiple moderate size files gzip compressed most >of the time. I'm thinking a lot about how to compress the raw data >for this > >http://basekb.com/subjectiveEye/ > >partially because it costs a few hundred a month to store it in AWS, >but also because it costs a few hundred dollars to do a full scan of >the data. It turns out the keys that come from Wikipedia are in >sequential order and Hadoop doesn't mess that up if it isn't >splitting, and that can be taken advantage of compress the data. > >Another requirement for this kind of thing is metadata about data >sets. For instance, things that have gone through the reducer >usually have had sorting and grouping done on them and if you have two >data sets that have been sorted and grouped the same way you can merge >them together with a priority queue instead of doing a reduce side >join and that saves you the cost of doing the reduce. Today this is >an optimization on the low-level coding I'm doing, but a piggish kind >of system might change the query plan if it knew how the data is >organized. (Or maybe knew that multiple copies of the data set >existed packaged different ways) Similarly, namespace prefixes, >information about encoding choices, types and such can all go in that >metadata. This sounds like application specific information to me. The aim of these libraries was not to solve every possible problem but rather to give you a set of building blocks that allowed you to implement your application and not have to worry about plumbing so much. > >I look at slides 21 and 25 and see a lot of choices I don't like. The choices made are all about trade offs, depending on your application and how you want to process data then different approaches will be viable. With the size of the datasets we often work in having horizontal scalability was a major requirement so ability to split inputs was vital as otherwise you aren’t making use of your available compute resources efficiently. There is also the factor that many formats physically can’t be processed on a line by line basis because you need access to all the metadata that might have occurred prior to the current line e.g. namespace prefixes, base URI etc. >From a reliability point of view, I find only the line-based >processing acceptable. I can't accept data loss because of the syntax >errors that are endemic in "in the wild" RDF data sets. I can't >accept the possibility of memory blow-outs, in fact the whole reason >I am using Hadoop is because my last system used to run out of memory. Again this is why we explored different approaches, different users will have different trade offs they are willing to make. > On the other hand I am very aware of how much it costs to create a >new RIOT parser for each line, and that's the reason why it takes 25 >machine*hours to process :BaseKB. One of our developers raised the idea that for some formats you can potentially take an approach whereby you start with the whole file approach and then fallback to block then line based as and when you encounter errors. It’s entirely feasible to do this just requires a bunch of development effort and this idea was only floated at the conclusion of the POC when the internal project had already been shuttered. > I think some of these >implementations involve passing data between threads; perhaps you can >get the cost of that down by batching, but thread switching and >coordination is expensive. > >For me at this point, multiple outputs are necessary because I'm >using them in my sieve3 stage. (The need to recode this for API >changes is the reason I haven't switched to Hadoop 2) In which case these libraries won’t help you in the short term because they are all built for MRv2 against Hadoop 2.2.0 > >Another thing I am thinking about is that the Map/Reduce API isn't a >perfect match for most workloads and that the real future is > >http://tez.incubator.apache.org/ > >Tez is a better paradigm for jobs that have multiple ins and outs, so >it might make sense to just skip the Rube Goldberg machine that is >used for multiple outs in the MR API and just do things the Tez way. Maybe, but then there’s a bunch of other Hadoop-o-sphere projects all claiming to do similar things. Certainly the computation model is moving beyond just Map/Reduce but it’s hard to know which horse to back. I’ve no experience of Tez but the fact that its website tells me nothing of how to implement a Tez workflow isn’t entirely encouraging from my POV. There’s various other interesting projects like Pig, Spark etc which might also be targets for integration longer term. Ultimately though everything requires someone to contribute it and the community of people doing RDF/Linked Data/Semantic Web is minute compared to the community of people doing Hadoop and there isn’t necessarily much overlap. And unfortunately the few of us like ourselves doing both are already heavily over-subscribed Rob > > >On Tue, Apr 1, 2014 at 9:09 AM, Andy Seaborne <[email protected]> wrote: >> On 01/04/14 12:11, Rob Vesse wrote: >>> Ok, I think probably what is best is if I roll up the relevant stuff >>>into >>> patches on the JIRA so you can apply them yourself. They'll be one for >>> the incubator website and one for Jena >>> >>> I can probably also add my Powerpoint slides to the JIRA issue which >>> summarise a lot of what is (and isn't there) and I'll send a write up >>>to >>> the list at some point this week. >> >> Great. >> >> Andy >> >>> >>> Rob >>> >>> On 01/04/2014 11:19, "Andy Seaborne" <[email protected]> wrote: >>> >>>> On 01/04/14 10:25, Rob Vesse wrote: >>>>> Andy >>>>> >>>>> I have now got all the necessary legal and management clearance from >>>>> Cray to >>>>> move ahead with donating the experimental Hadoop RDF work that we¹ve >>>>> previously discussed privately to the Jena project. >>>> >>>> I know about it but not what it is I'm very interested in seeing it >>>>... >>>> >>>>> So I am ready to move >>>>> forwards with the IP Clearance process but I¹ve run into a slight >>>>>snag >>>>> in >>>>> that it appears to require an ASF Member/Officer to actually >>>>>formally be >>>>> responsible for executing the process. >>>>> >>>>> I think in reality this just means that an ASF member has to check >>>>>that >>>>> all >>>>> the boxes have been appropriately ticked and run the necessary votes >>>>> once we >>>>> reach that stage of the process. As you are an ASF member would you >>>>>be >>>>> willing to do this? >>>>> >>>>> I am happy to do all the leg work, I already have the IP clearance >>>>> document >>>>> ready to commit to the Incubator website and the initial code base >>>>> ready to >>>>> commit to the Experimental area. Once these initial things are >>>>> committed I >>>>> can start working through the remaining steps such as changing the >>>>> Copyright >>>>> Headers, putting together the NOTICE and LICENSE files appropriately >>>>> etc. I >>>>> will also go ahead and file a JIRA for tracking purposes. >>>>> >>>>> Before I start on this I wanted to check that you are OK with acting >>>>>as >>>>> the >>>>> responsible person for this? >>>> >>>> /me off to read up on the process ... >>>> >>>> Yes, fine. I have incubator karma to update the IP pages. >>>> >>>> Full steam ahead! >>>> >>>>> >>>>> Cheers, >>>>> >>>>> Rob >>>> >>>> In terms of the destination being the Jena project, this looks like it >>>> is fully aligned to the project charter: >>>> >>>> """ >>>> the creation and maintenance of >>>> open-source software related to accessing, storing, querying, >>>> publishing and reasoning with semantic web data while >>>> adhering to relevant W3C and community standards >>>> for distribution at no charge to the public. >>>> """ >>>> >>>> and I (personally) am quite relaxed about any variety if there is some >>>> connection to Jena even if it's quite small (e.g. there is at least >>>>one >>>> import statement like "org.apache.jena....."!). We can't, in all >>>> honesty, adopt random other unrelated software without at least >>>>checking >>>> with other linked data projects in ASF. But at the same time, linked >>>> data @ASF isn't the Hadoop-infrastructure ecosystem where each piece >>>>can >>>> sustain it's own TLP. >>>> >>>> Andy >>>> >>> >>> >>> >>> >> > > > >-- >Paul Houle >Expert on Freebase, DBpedia, Hadoop and RDF >(607) 539 6254 paul.houle on Skype [email protected]
