Re: Review Request: GIRAPH-277: Text Vertex Input/Output Format base classes overhaul

Eli Reisman Thu, 09 Aug 2012 18:01:11 -0700


> On Aug. 10, 2012, 12:47 a.m., Eli Reisman wrote:
> > Hey Jaeho, great work here. This was a lot and could not have been easy to 
> > wade through. I love your attention to detail and fantastic comments that 
> > can lead the way for others to figure out how to put together IO formats. 
> > The small things I noticed were:
> > 
> > - in AdjacencyListVertexOutputFormat, maybe call writeLine() writeVertex() 
> > like it was, since it takes a vertex and is writing at the unit of one 
> > vertex per line. The other formats consume input to write at this same 
> > granularity and it keeps the processing units consistent throughout the 
> > code. Its just a detail that we are writing to text lines in this family of 
> > OutputFormats, many folks here will be using other formats for IO that do 
> > not involve text (including myself most of the time.) You might rename the 
> > enclosing class to indicate lines and not just text or vertices are being 
> > written out here? Or just add some nice javadoc comments to make it clear 
> > to users what is going to happen here.
> > 
> > - in TextVertexInputFormat some comments mix web-escaped sequences of 
> > greater-than and less-than chars with the real angle-brackets.
> > 
> > - in VertexReaderFromEachLine, the getId(), getValue(), and getEdge() are a 
> > clean solution, but in their Text input they force the user to reparse the 
> > same line over and over. One pass for parsing is nice and easier not to 
> > mess up so I'm not sure this saved the users any time. In the version that 
> > is preprocessed that takes a generic type, you could pass a list of tokens, 
> > but then the user must be sure the number of tokens read is consistent with 
> > what they found on the line or those methods will trusting the sequence in 
> > the token list. This is not bad, but again it doesn't seem any easier for a 
> > user to debug and for the parsing of single lines when the user is aware of 
> > the assumed formatting and can work with tokens that don't make sense to 
> > the parser "in context" of the line. consuming each line in one place as a 
> > stream is familiar to programmers in general, and they can break the work 
> > into methods as they see fit. The place where this makes the code cleaner 
> > is the vertex creation piece that puts all the parsed values together, but 
> > you've already written that for them, so it doesn't really give the user 
> > less code to write. In the end, many users will and do use non-text IO so I 
> > think just having access to the line information to parse might be enough 
> > here.
> > 
> > In some cases, while reading the old and new code in the diff, I got the 
> > feeling a mix of both would be the best and tightest solution, mixed with a 
> > lot better commenting of the sort you put into you new code to light the 
> > way for them.  How can we tweak this code to focus on making the unfamiliar 
> > Giraph/Hadoop boilerplate go away without adding too many components for a 
> > new user to remember to compose and wire up? I think the solution that 
> > would make me do backflips would be the one that solves that puzzle, and I 
> > feel like you're already most of the way there with this patch. 
> > 
> > I would be curious, have you run this code while mixing and matching these 
> > IO formats and some vertices, and perhaps mixing in the wrong one sometimes 
> > to see if it breaks as it should? I would like to verify that shifting some 
> > of the generic syntax around does not confuse ReflectionUtils at any point 
> > in the game, this is why we have so many generic defs all over the code 
> > right now, and its delicate. the utils have to successfully walk the 
> > inheritance chain all the way to where the parameters are concretely set 
> > without missing a link in the chain or they loose the type info. I think 
> > we're probably in the clear here, but I'd like to try a dry run just to see 
> > what happens if you haven't already.
> > 
> > On the whole, I like this. It adds more clarity than it takes away. I think 
> > we could go a step simpler here and it would be a win, but again thats just 
> > me, and with this in place maybe the additional simplification would be the 
> > work of someone else in another JIRA.
> >


One note here, in the VertexReaderFromEachLine bit, I think my last point is 
this: we get to say createVertex( getId(line), getValue(line), getEdge(line) .. 
) etc but you do that for them, and at the cost of them implementing 3 new 
methods and accepting that the tokens they receive are in context, and in the 
order they expect. The alternative of just parsing their own line seems like a 
familiar part of the puzzle for most programmers. The parts they need help with 
are probably navigating the maze of less familiar Giraph/Hadoop objects to 
manipulate and implement on the way to a finished IO format. Anything to make 
that checklist shorter would be a big win.


- Eli


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/6402/#review10101
-----------------------------------------------------------


On Aug. 6, 2012, 9:21 p.m., Jaeho Shin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/6402/
> -----------------------------------------------------------
> 
> (Updated Aug. 6, 2012, 9:21 p.m.)
> 
> 
> Review request for giraph.
> 
> 
> Description
> -------
> 
> (copied from GIRAPH-277)
> 
> The current way of implementing VertexInputFormat and VertexReader had bad 
> smell. It required users to understand how these two classes are glued 
> together, and forced similar codes to be duplicated in every new input 
> format. (Similarly for the VertexOutputFormat and VertexWriter.) Anyone who 
> wants to create a new format should create an underlying record reader or 
> writer at the right moment and delegate some calls to it, which seemed 
> unnecessary detail being exposed. Besides, type parameters had to appear all 
> over every new format code, which was extremely annoying for both reading 
> existing code and writing a new one. I was very frustrated writing my first 
> format code especially when I compared it to writing a new vertex code. I 
> thought writing a new input/output format should be as simple as vertex.
> So, I have refactored TextVertexInputFormat and OutputFormat into new forms 
> that have no difference in their interfaces, but remove a lot of burden for 
> subclassing. Instead of providing static VertexReader base classes, I made it 
> a non-static inner-class of its format class, which helps eliminate the 
> repeated code for gluing these two, already tightly coupled classes. This has 
> additional advantage of eliminating all the Generics type variables on the 
> VertexReader side, which makes overall code much more concise. I added 
> several useful TextVertexReader base classes that can save efforts for 
> implementing line-oriented formats.
> 
> 
> This addresses bug GIRAPH-277.
>     https://issues.apache.org/jira/browse/GIRAPH-277
> 
> 
> Diffs
> -----
> 
>   trunk/checkstyle.xml 1369914 
>   
> trunk/src/main/java/org/apache/giraph/examples/IntIntNullIntTextInputFormat.java
>  1369914 
>   trunk/src/main/java/org/apache/giraph/examples/SimplePageRankVertex.java 
> 1369914 
>   trunk/src/main/java/org/apache/giraph/examples/SimpleSuperstepVertex.java 
> 1369914 
>   
> trunk/src/main/java/org/apache/giraph/examples/SimpleTextVertexOutputFormat.java
>  1369914 
>   
> trunk/src/main/java/org/apache/giraph/examples/VertexWithComponentTextOutputFormat.java
>  1369914 
>   trunk/src/main/java/org/apache/giraph/graph/VertexWriter.java 1369914 
>   
> trunk/src/main/java/org/apache/giraph/lib/AdjacencyListTextVertexInputFormat.java
>  PRE-CREATION 
>   
> trunk/src/main/java/org/apache/giraph/lib/AdjacencyListTextVertexOutputFormat.java
>  1369914 
>   trunk/src/main/java/org/apache/giraph/lib/AdjacencyListVertexReader.java 
> 1369914 
>   trunk/src/main/java/org/apache/giraph/lib/IdWithValueTextOutputFormat.java 
> 1369914 
>   trunk/src/main/java/org/apache/giraph/lib/JsonBase64VertexInputFormat.java 
> 1369914 
>   trunk/src/main/java/org/apache/giraph/lib/JsonBase64VertexOutputFormat.java 
> 1369914 
>   
> trunk/src/main/java/org/apache/giraph/lib/JsonLongDoubleFloatDoubleVertexInputFormat.java
>  1369914 
>   
> trunk/src/main/java/org/apache/giraph/lib/JsonLongDoubleFloatDoubleVertexOutputFormat.java
>  1369914 
>   
> trunk/src/main/java/org/apache/giraph/lib/LongDoubleDoubleAdjacencyListVertexInputFormat.java
>  1369914 
>   
> trunk/src/main/java/org/apache/giraph/lib/TextDoubleDoubleAdjacencyListVertexInputFormat.java
>  1369914 
>   trunk/src/main/java/org/apache/giraph/lib/TextVertexInputFormat.java 
> 1369914 
>   trunk/src/main/java/org/apache/giraph/lib/TextVertexOutputFormat.java 
> 1369914 
>   
> trunk/src/test/java/org/apache/giraph/lib/TestAdjacencyListTextVertexOutputFormat.java
>  1369914 
>   
> trunk/src/test/java/org/apache/giraph/lib/TestIdWithValueTextOutputFormat.java
>  1369914 
>   
> trunk/src/test/java/org/apache/giraph/lib/TestLongDoubleDoubleAdjacencyListVertexInputFormat.java
>  1369914 
>   
> trunk/src/test/java/org/apache/giraph/lib/TestTextDoubleDoubleAdjacencyListVertexInputFormat.java
>  1369914 
> 
> Diff: https://reviews.apache.org/r/6402/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Jaeho Shin
> 
>

Re: Review Request: GIRAPH-277: Text Vertex Input/Output Format base classes overhaul

Reply via email to