Its not lossy, that would be a disaster if it was. You specify the compressor so you can use what codecs are supported, e.g. LZO
On Thu, Jul 14, 2011 at 7:40 AM, Dhruv Kumar <[email protected]> wrote: > On Thu, Jul 14, 2011 at 10:29 AM, Sean Owen <[email protected]> wrote: > > > Serialization itself has no effect on accuracy; doubles are encoded > exactly > > as they are in memory. > > That's not to say that there may be an accuracy issue in how some > > computation proceeds, but it is not a function of serialization. > > > > Interesting, are there factors specific to Hadoop (not just subtleties of > Java or the OS) which can affect accuracy and I should be concerned about? > > Also, Sequence File stores compressed key value pairs does it not? Is that > compression lossy? > > > > On Thu, Jul 14, 2011 at 2:54 PM, Dhruv Kumar <[email protected]> > wrote: > > > > > What are the algorithms and codecs used in Hadoop to compress data and > > pass > > > it around between mappers and reducers? I'm curious to understand the > > > effects it has (if any) on double precision values. > > > > > > So far my trainer (MAHOUT-627) uses unscaled EM training and I'm soon > > > starting the work on using log-scaled values for improved accuracy and > > > minimizing underflow. It will be interesting to compare the accuracy of > > the > > > unscaled and log scaled variants so I'm curious. > > > > > > -- Yee Yang Li Hector http://hectorgon.blogspot.com/ (tech + travel) http://hectorgon.com (book reviews)
