> On Aug. 10, 2012, 12:47 a.m., Eli Reisman wrote: > > Hey Jaeho, great work here. This was a lot and could not have been easy to > > wade through. I love your attention to detail and fantastic comments that > > can lead the way for others to figure out how to put together IO formats. > > The small things I noticed were: > > > > - in AdjacencyListVertexOutputFormat, maybe call writeLine() writeVertex() > > like it was, since it takes a vertex and is writing at the unit of one > > vertex per line. The other formats consume input to write at this same > > granularity and it keeps the processing units consistent throughout the > > code. Its just a detail that we are writing to text lines in this family of > > OutputFormats, many folks here will be using other formats for IO that do > > not involve text (including myself most of the time.) You might rename the > > enclosing class to indicate lines and not just text or vertices are being > > written out here? Or just add some nice javadoc comments to make it clear > > to users what is going to happen here. > > > > - in TextVertexInputFormat some comments mix web-escaped sequences of > > greater-than and less-than chars with the real angle-brackets. > > > > - in VertexReaderFromEachLine, the getId(), getValue(), and getEdge() are a > > clean solution, but in their Text input they force the user to reparse the > > same line over and over. One pass for parsing is nice and easier not to > > mess up so I'm not sure this saved the users any time. In the version that > > is preprocessed that takes a generic type, you could pass a list of tokens, > > but then the user must be sure the number of tokens read is consistent with > > what they found on the line or those methods will trusting the sequence in > > the token list. This is not bad, but again it doesn't seem any easier for a > > user to debug and for the parsing of single lines when the user is aware of > > the assumed formatting and can work with tokens that don't make sense to > > the parser "in context" of the line. consuming each line in one place as a > > stream is familiar to programmers in general, and they can break the work > > into methods as they see fit. The place where this makes the code cleaner > > is the vertex creation piece that puts all the parsed values together, but > > you've already written that for them, so it doesn't really give the user > > less code to write. In the end, many users will and do use non-text IO so I > > think just having access to the line information to parse might be enough > > here. > > > > In some cases, while reading the old and new code in the diff, I got the > > feeling a mix of both would be the best and tightest solution, mixed with a > > lot better commenting of the sort you put into you new code to light the > > way for them. How can we tweak this code to focus on making the unfamiliar > > Giraph/Hadoop boilerplate go away without adding too many components for a > > new user to remember to compose and wire up? I think the solution that > > would make me do backflips would be the one that solves that puzzle, and I > > feel like you're already most of the way there with this patch. > > > > I would be curious, have you run this code while mixing and matching these > > IO formats and some vertices, and perhaps mixing in the wrong one sometimes > > to see if it breaks as it should? I would like to verify that shifting some > > of the generic syntax around does not confuse ReflectionUtils at any point > > in the game, this is why we have so many generic defs all over the code > > right now, and its delicate. the utils have to successfully walk the > > inheritance chain all the way to where the parameters are concretely set > > without missing a link in the chain or they loose the type info. I think > > we're probably in the clear here, but I'd like to try a dry run just to see > > what happens if you haven't already. > > > > On the whole, I like this. It adds more clarity than it takes away. I think > > we could go a step simpler here and it would be a win, but again thats just > > me, and with this in place maybe the additional simplification would be the > > work of someone else in another JIRA. > >
One note here, in the VertexReaderFromEachLine bit, I think my last point is this: we get to say createVertex( getId(line), getValue(line), getEdge(line) .. ) etc but you do that for them, and at the cost of them implementing 3 new methods and accepting that the tokens they receive are in context, and in the order they expect. The alternative of just parsing their own line seems like a familiar part of the puzzle for most programmers. The parts they need help with are probably navigating the maze of less familiar Giraph/Hadoop objects to manipulate and implement on the way to a finished IO format. Anything to make that checklist shorter would be a big win. - Eli ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/6402/#review10101 ----------------------------------------------------------- On Aug. 6, 2012, 9:21 p.m., Jaeho Shin wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/6402/ > ----------------------------------------------------------- > > (Updated Aug. 6, 2012, 9:21 p.m.) > > > Review request for giraph. > > > Description > ------- > > (copied from GIRAPH-277) > > The current way of implementing VertexInputFormat and VertexReader had bad > smell. It required users to understand how these two classes are glued > together, and forced similar codes to be duplicated in every new input > format. (Similarly for the VertexOutputFormat and VertexWriter.) Anyone who > wants to create a new format should create an underlying record reader or > writer at the right moment and delegate some calls to it, which seemed > unnecessary detail being exposed. Besides, type parameters had to appear all > over every new format code, which was extremely annoying for both reading > existing code and writing a new one. I was very frustrated writing my first > format code especially when I compared it to writing a new vertex code. I > thought writing a new input/output format should be as simple as vertex. > So, I have refactored TextVertexInputFormat and OutputFormat into new forms > that have no difference in their interfaces, but remove a lot of burden for > subclassing. Instead of providing static VertexReader base classes, I made it > a non-static inner-class of its format class, which helps eliminate the > repeated code for gluing these two, already tightly coupled classes. This has > additional advantage of eliminating all the Generics type variables on the > VertexReader side, which makes overall code much more concise. I added > several useful TextVertexReader base classes that can save efforts for > implementing line-oriented formats. > > > This addresses bug GIRAPH-277. > https://issues.apache.org/jira/browse/GIRAPH-277 > > > Diffs > ----- > > trunk/checkstyle.xml 1369914 > > trunk/src/main/java/org/apache/giraph/examples/IntIntNullIntTextInputFormat.java > 1369914 > trunk/src/main/java/org/apache/giraph/examples/SimplePageRankVertex.java > 1369914 > trunk/src/main/java/org/apache/giraph/examples/SimpleSuperstepVertex.java > 1369914 > > trunk/src/main/java/org/apache/giraph/examples/SimpleTextVertexOutputFormat.java > 1369914 > > trunk/src/main/java/org/apache/giraph/examples/VertexWithComponentTextOutputFormat.java > 1369914 > trunk/src/main/java/org/apache/giraph/graph/VertexWriter.java 1369914 > > trunk/src/main/java/org/apache/giraph/lib/AdjacencyListTextVertexInputFormat.java > PRE-CREATION > > trunk/src/main/java/org/apache/giraph/lib/AdjacencyListTextVertexOutputFormat.java > 1369914 > trunk/src/main/java/org/apache/giraph/lib/AdjacencyListVertexReader.java > 1369914 > trunk/src/main/java/org/apache/giraph/lib/IdWithValueTextOutputFormat.java > 1369914 > trunk/src/main/java/org/apache/giraph/lib/JsonBase64VertexInputFormat.java > 1369914 > trunk/src/main/java/org/apache/giraph/lib/JsonBase64VertexOutputFormat.java > 1369914 > > trunk/src/main/java/org/apache/giraph/lib/JsonLongDoubleFloatDoubleVertexInputFormat.java > 1369914 > > trunk/src/main/java/org/apache/giraph/lib/JsonLongDoubleFloatDoubleVertexOutputFormat.java > 1369914 > > trunk/src/main/java/org/apache/giraph/lib/LongDoubleDoubleAdjacencyListVertexInputFormat.java > 1369914 > > trunk/src/main/java/org/apache/giraph/lib/TextDoubleDoubleAdjacencyListVertexInputFormat.java > 1369914 > trunk/src/main/java/org/apache/giraph/lib/TextVertexInputFormat.java > 1369914 > trunk/src/main/java/org/apache/giraph/lib/TextVertexOutputFormat.java > 1369914 > > trunk/src/test/java/org/apache/giraph/lib/TestAdjacencyListTextVertexOutputFormat.java > 1369914 > > trunk/src/test/java/org/apache/giraph/lib/TestIdWithValueTextOutputFormat.java > 1369914 > > trunk/src/test/java/org/apache/giraph/lib/TestLongDoubleDoubleAdjacencyListVertexInputFormat.java > 1369914 > > trunk/src/test/java/org/apache/giraph/lib/TestTextDoubleDoubleAdjacencyListVertexInputFormat.java > 1369914 > > Diff: https://reviews.apache.org/r/6402/diff/ > > > Testing > ------- > > > Thanks, > > Jaeho Shin > >
