On 2017-05-04 12:14 (+0200), Michał J Gajda <mjga...@gmail.com> wrote:
> Interesting. Do You have an error report filed anywhere to peruse? Well, I wrote the below. In the mean time, I've tried replacing "insertWith" with a manual lookup and update. This changed the cases where the segfaults happen, but didn't eliminate them. And I tried using the safe freeze function, to no avail. And I implemented some of the functionality using a Vector, which gives correct results (but without reimplementing a real hashtable, it can only be used for the histogram collection, not k-mer counting, since with large k that requires full 64-bit indexes into the vector) I've also run on several different computers, so I don't think it's a hardware error. -k ---------------------------------------- So: I have a program that indexes k-mers (fixed-size words) as Word64's, storing the different ones along with their counts in a Judy array, and at the end, writing to a file. So an associative map mapping k-word -> count. Then, the second stage re-reads this file, and builds a histogram, tallying how often each count occurs. let's call this count -> frequency For some time, this just seemed to work. But then I implemented an expectation-maximization algorithm to model the frequency distributions, and that failed to terminate in some cases. Going back, I found the following histogram to cause the trouble: ==> SRR901891_1.32.hist <== 865 809 866 775 867 757 868 781 869 725 870 778 871 795 872 774 873 771 48036128 2017612633061982208 Notice that last value? Apparently, there are 2.018e18 different k-mers with a frequency of 48036128. This is clearly not right. Notice also that the latter number is 0x01c000...000. Does this carry any special meaning? One possibility is of course some garbage in the k-mer count index, but I also have a dump command that prints the contents as text. I can find nothing wrong with that, and the highest count k-mers look like this: % sort -nr -k2 *dump* | head tgggggtccagccatggagaagagtttagaca 1670602 gggggtccagccatggagaagagtttagacac 1653146 gtctaaactcttctccatggctggacccccaa 1638158 gaatattatttgggggtccagccatggagaag 1567591 ggaatattatttgggggtccagccatggagaa 1539378 actagtgcttaggaaatctattggaggcagaa 1532596 gcttctgcctccaatagatttcctaagcacta 1531091 aggaatattatttgggggtccagccatggaga 1528806 ctagtgcttaggaaatctattggaggcagaag 1527682 aaaggaatattatttgggggtccagccatgga 1518392 One thing to look into is how the histogram is built. I have wrapped this in a function (mk_judy, code below) that returns a struct of operations, called FreqCount. mk_judy takes as an argument the number of bits to use in the index - but this isn't really used for anything (JudyL is JudyL). Yet, when I changed this parameter from 16 to 32, the last line changed to: 17180896 2017612633061982208 And with 64: 37271776 2017612633061982208 The latter number is always the same, the former ends with 0x8e0, but appears to vary in the most significant bits. Everything else is the same in the histograms, I did a diff on the outputs. -- Code ---------------------------------------- mk_judy :: Int -> IO FreqCount mk_judy l = do j <- J.new :: IO (J.JudyL Int) let ac k = k `seq` J.insertWith (+) k 1 j gc k = k `seq` fromMaybe 0 `fmap` J.lookup k j sc k v = J.insert k v j es = J.unsafeFreeze j >>= J.elems ks = J.unsafeFreeze j >>= J.keys as = J.unsafeFreeze j >>= J.toList return $ FreqCount { key_bits = l , add_count = ac, get_count = gc, set_count = sc , keys = ks, counts = es, assocs = as } -- Code ---------------------------------------- If we're sure there is nothing wrong with the Judy library, the only other thing I can think of is the unsafeFreeze. I'm pretty sure I don't modify the Judy array after accessing it through the "frozen" functions - but could it e.g. be the case that j is frozen with some operations still unevaluated? I tried adding "j `seq`" before unsafeFreeze, and got: 33769696 2017612633061982208 Again, 0x...8e0 and the usual 0x01c0... Anyway, that's where it stands at the moment. I'm not sure what to do, it does worry me that this counting seems to fail on occasion. Note also that the histogram gets truncated (always at the same place, after 873), and it should really go up to 1.5 million. ---------------------------------------- -- If I haven't seen further, it is by standing in the footprints of giants _______________________________________________ Biohaskell mailing list Biohaskell@biohaskell.org http://biohaskell.org/cgi-bin/mailman/listinfo/biohaskell