Two other points - if you have several input files make a custom input whose reader make protected boolean isSplitable(JobContext context, Path file) return false and you do not have problems starting in the middle - If the input is not truly massive you can simply write a piece of code to find the longest quotes string be reading the entire file - on a single box you can handle tens of gigs per hour.
On Wed, Feb 22, 2012 at 3:22 PM, Keith Wiley <kwi...@keithwiley.com> wrote: > Thanks for responding. Unfortunately, the data already exists. I have no > way of instituting limitations on the format, much less reformatting it to > suit my needs. It is true that I can make some general assumptions about > the data (unrealistically long strings are unlikely to occur), but I can't > write a steadfastly robust reader under such assumptions. > > The problem is that even if I impose an assumption of limited length > strings, that doesn't prescribe a method for handling the possibility of an > error. If a string really is too long and the reader fails to detect it, > I'm not sure how to insure that the reader or subsequent map task fails in > a clean fashion. > > If I could at least impose an assumption of this sort...and then detect > and fail cleanly on violations of the assumption, that would go a long way. > > I'll think about it. > > Thanks. > > On Feb 22, 2012, at 14:59 , Steve Lewis wrote: > > > It sounds like you may need to give up a little to make things work - > Suppose, for example, that you placed a limit on the length of a quoted > string, > > say 1024 characters - the reader can then either start at the beginning > or read back by, say 1024 characters to see if the start is in a quote and > proceed accordingly - it quoted strings can be of arbitrary length there > may be no good solution > > > ________________________________________________________________________________ > Keith Wiley kwi...@keithwiley.com keithwiley.com > music.keithwiley.com > > "I do not feel obliged to believe that the same God who has endowed us with > sense, reason, and intellect has intended us to forgo their use." > -- Galileo Galilei > > ________________________________________________________________________________ > > -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com