The annotated tag, v0.9.4 has been created
        at  93a16eba40529d103a35af3c18565a04f35d6a07 (tag)
   tagging  235cf4366ff3b3244a1121b6b5910ca004830f46 (commit)
 tagged by  Jared Simpson
        on  Fri Nov 19 15:12:15 2010 +0000

- Shortlog ------------------------------------------------------------
Tagging v0.9.4

Jared Simpson (603):
      adding new project
      Importing stub files
      Importing configuration files
      Added edge labels to dotty output
      Renamed IVertex to Vertex
      Added vertex merging and removal
      added simplify() function which removes transitive edges from the seqgraph
      Added README
      added function to load edges into the graph
      Initial import of UniEst, Util directories
      First version of UniEst is complete
      In progress check-in of scaffolding code
      Scaffolding in-progress checkin.
      Fixed horrible bug in UniEst and added some better command line parsing
      - Implemented contig uniqueness estimator by overhanging pairs, the 
performance is similar to depth estimation for long (>= 100bp) contigs but 
worse for small
      - UniEst: reworked command line arguments. The align file is now required 
but inference over depth can be disabled with the --no_depth flag. The 
--no_pair flag is removed, it is automatically set when a paired/hist file are 
passed in.
      UniEst: Added graph-based uniqueness inference. It does not work 
particularly well.
      Cleaned up resolve
      Committing test code stub, unit tests will go here
      Added automake files
      Refactored the SeqGraph to be a template.
      More refactoring.
      More refactoring, renamed the SeqGraph module to Bigraph which is more 
general
      Refactored scaffold code to use new templated Bigraph class.
      In-progress checkin of experimental distance estimation code. Lots of 
testing/debug hooks in BDE.cpp
      Initial checkin of development suffixtree and bwt code
      Added suffix tree
      Big refactoring of BWT code
      Refactored code out of BWT class into SuffixArray class
      Bug-fixes
      Merge function fixed. Too slow. In-place construction is needed
      Checking prior to refactor
      Added main program, refactored
      Added overlapper
      Added SeqReader
      Added overlap data structure and HitData class
      Added getopt to index
      Added assemble program and laid out skeleton
      Implemented initial string graph construction algorithm
      Massive refactor
      Added proper destructor to Vertex, fixed memleak in Bigraph::merge
      Added StringGraph clasess which implement Myer's formulation of a string 
graph
      Refactored the vertex merge logic to be more intuitive
      Added validation function to stringvertex and vertex
      Simplified overlap representation
      Added generation of reverse suffix array to index
      Updated assemble parameters
      Fixed bug where the orientation of edges that were being merged were 
incorrect
      Rewrote overlap processing logic so all hits for a given read are 
processed at the same time
      Added checks to overlap detection to ensure overlaps are sane
      Moved stringgraph construction functions to SGUtil
      Implemented redundant read detection and removal algorithms
      Renamed SAID to SAElem
      Minor renaming
      Added early exit to SuffixArray::removeReads if the id list to remove is 
empty
      Implemented exact prefix/suffix matching
      Added matching string to oview output
      Fixed bug in oview which was not displaying reverse complement alignments 
correctly
      Renamed BWT::getHits to BWT::getPrefixHits to more accurately reflect its 
purpose
      Fixed bug in SuffixArray::extractPrefixSuffixOverlaps which could output 
multiple hits per read instead of only the optimal hit
      Added verbose guard around prints
      Switched Vertex implementation to use a list instead of an STL map, for 
space reasons and to allow easier sorting for the transitive removal algorithm
      Added edge sorting functions to bigraph and vertex classes.
      Implemented myers transitive removal algorithm
      Fixed bug in transred algorithm - twin edges were not being removed
      Fixed bug in TR algorithm. Edges must be marked for removal and removed 
in a single pass afterwards or else some reducible edges may be missed.
      Implemented bucket sort as an improvement over using std::sort across the 
whole suffix array. Should modify this to use histogram sort or another variant 
to drop the memory usage
      Implemented histogram sort
      Refactored SuffixCompare into its own files
      Tweaked parameters
      Removed prints
      Removed some prints
      Swapped order of conditions for terminating loop in histogram sort so 
that valgrind doesnt complain about out of bounds access
      Removed print of contigs before transitive removal/compaction
      Added DNAString class which is a wrapper for a c-string
      Ported Bentley/Sedgewick/s multikey quicksort
      Implemented Nong-Zhang-Chan induced copying suffix array construction 
algorithm. In tests it only has to sort 30% of the suffixes using 
MKQS/histogram sort which is a large improvement. The algorithm can probably be 
modified further. The code is in need of a cleanup as well.
      Restored writing the SA out after indexing
      Inlined some frequently called functions to avoid function call overhead
      Refactor SuffixCompare class to distinguish comparing by sequence (in a 
radix sort) and ID
      Added mkqs which was forgotten
      Refactored SA construction code out of index and into class
      Much refactorint
      Moved read/write functions into SuffixArray/BWT classes
      Fixed macro.
      Implemented sampling of the occurance array to lower memory
      macroed out calls in BWT for readability reasons
      Substantial rewrite of the overlap program
      Deleted unnecessary HitData.cpp
      Rewrote oview to draw all the alignments for a particular read at the 
same time
      Implemented inexact matching to BWT, there is currently no limit on the 
amount of backtracking so it is significantly slower
      Implemented seeded bwt alignment algorithm for inexact suffix/prefix 
matching.
      Portability fixes, added includes and fixed printfs so that it would 
compile on my home machine (32-bit Ubuntu 9.04)
      Slight reworking of the structure of the alignment algorithm in 
bwt_algorithms.cpp
      Fixed oview bug
      Refactored the Overlap struct to include a sub-struct which holds the 
matching coordinates
      Major refactoring to StringGraph. Overhangs are now stored as intervals 
instead of actual strings. The bookkeeping is a bit messy and could probably be 
cleaned up but the checked in version works. It simplified the merging logic 
somewhat.
      Inlining
      Added trimming algorithm, sweepVertex, remove duplicate hits
      Added error correction algorithm to oview, added bubble popping algorithm 
to StringGraph (in progress)
      Fixed bug in duplicate hit removal, hits to BWT and RevBWT must be 
considered differently to avoid stomping IDs
      Added much better, block-wise bwt alignment algorithm
      Added vertex removal program which eliminates vertices that have a high 
error rate.
      Factored Match into its own file
      More changes to coordinate system. Now all the changes of frame happen 
internally to the Match class greatly simplifying client code.
      Progress towards inferring transitive closure edges from consistent 
overlaps. Edges that reveal containments are causing problems.
      Working implementation of transitive closure algorithm.
      Inlined AlphaCount constructor
      Refactored the interval data out of BWTAlign
      First pass at exact assembly/string graph construction algorithm. slow.
      Refactored BWTAlgorithms module, made it a proper namespace and changed 
functions to be more generic
      Added AssembleExact functions and implemented initial version of exact 
string graph algorithm
      Factored out the extension gathering logic for AssembleExact so the same 
function can find left and right extensions
      Working version of exact extension assembly algorithm. Needs cleaning up.
      Added fast method to get the smallest consistent extension for a given 
sequence
      Full implementation of irreducible overlap extraction algorithm. It now 
outputs all irreducible overlaps instead of just the unique one. It will skip 
short substring that are contained within some other string but in general 
substrings in the data set are not handled well. This should be improved.
      Refactored irreducible algorithm into the BWTAlgorithms collection.
      Minor formatting changes
      Added code to pair vertices based on read ids
      propedit
      Added algorithm to output all sequences of length k from a BWT
      Added output code and debug printing. The extraction is (understandably) 
slow for large l.
      Removed unused exact executable
      Removed unused filehandle
      Test commit, no change
      Removed autogenerated files
      Added .gitignore
      Removed *.in files, updated .gitignore
      Updated .gitignore
      Added tools directory with useful scripts
      Added data directory
      Removed "exact" line from sga help text
      Removed comment
      Removed call to basename so code compiles on OSX which does not supply 
GNU basename function and POSIX version is unsuitable.
      First pass at bwt to string graph algorithm.
      Commiting changes to work from home.
      Changed command line parameter names, rewrote to use strings as buffers 
instead of arrays. Much less memory required.
      Commiting test code using hash_map for transfer to home.
      Added size tracking.
      Merge branch 'master' of ssh://127.0.0.1:2222/~/work/git_repository/sga
      Removed hash test code as its not smaller or faster than using std::map. 
The vertex finds are not a bottleneck.
      Reordered members in Edge class and changed GraphColor from an enum to 
uint8_t. The Edge class (and classes deriving from it) are very heavy and push 
the memory usage way up for big assemblies. It might be worth removing the 
start pointer from the Edge class to save 8 bytes.
      Generalized the irreducible overlap algorithm to handle reverse 
complement alignments simulatenously with regular alignments. The code is in 
need of a cleanup/simplification, particularly with how contained reads are 
handled.
      Minor change to assert
      Modified simplify to preferentially merge in the ED_SENSE direction so 
appending to strings is prefered to prepending
      Minor formatting change in a print
      Changed simplify output frequency
      Removed malloc.h include as OSX does not have this header
      Added ability to output RC reads to SE sampler
      Merge branch 'master' of /nfs/team71/phd/js18/work/git_repository/sga
      added -o and -m flags to specify output file and the minimum overlap size 
to accept, respectively.
      Enabled -o and -m flags in sga assemble
      Refactored Vertex class to keep edges in a vector instead of a list. Many 
edges must be removed from the vector but the erase() calls are not a 
bottleneck.
      Added functions to ensure that all edges for a vertex in a given 
direction are unique
      Refactored StringEdge/Vertex into its own file.
      Merged StringVertex/StringEdge into Vertex/Edge to simplify code and 
avoid (unused) inheritence overhead.
      Removed m_pStart member from Edge as it can be found from the m_pTwin 
member. This saves 1 pointer.
      Added BitChar class containing a simple bitset of 8 bits
      Updated Vertex::getMemSize to include the size of the string
      Added SimplePool and SimpleAllocator, an implementation of a 
zero-overhead memory pool for objects that do not need to be freed.
      Edge.h: Removed boost memory pool code, passed delete calls to memory 
pool (which do nothing by design)
      Factored Interval/SeqCoord classes out of the Util files
      Moved EdgeDir/EdgeComp definitions to GraphCommon from Util
      Large in-progress refactor of overlap stage. Now instead of outputting 
hits to each element of the suffix array the initial
      Changed overlap hit output mode to ascii for consistency with other 
programs.
      Fixed spelling error in Occurrence class
      Added missing files.
      Cleaned up overlap computation, removed dependency on hits
      Cleanup.
      Adding interval set stub code.
      Implemented the removal of sub-maximal overlap blocks (as produced in 
rare cases by BWTAlgorithms::findOverlapBlocks). The complete list of overlaps 
is sorted to find overlapping blocks which are split apart by 
OverlapBlocks::resolveOverlap. This replaces the IntervalSet idea that was 
never implemented. The algorithm could be slightly improved but the triggering 
case is so rare it isn't worth the extra complexity.
      Refactored OverlapBlock to remove the need for a seperate 
OverlapBlockRecord class.
      Formatting changes.
      Created LockedQueue class
      Added destructor to LockedQueue to destroy the mutex.
      Added stub OverlapThread class. Currently not compiled in.
      Implemented some of the threading code, fixed configure/makefiles
      Added warning as a note to self
      Refactored overlap module in prepartion for adding threads.
      Output formatting changes
      Moved defaults from parseArgs to declarations.
      Modified GPL boilerplate and added COPYING file to source tree.
      First pass at threaded overlapper. Lock contention currently kills 
performance.
      Implementation of threading that isn't clean but works. Will be cleaned 
up.
      Better threaded code, still in progress.
      Better but more complicated semaphora usage. Current version uses 
multiple buffers but it is probably more complex
      Stable version of threading overlap module. Output is not properly 
processed yet but the thread logic is more or less complete.
      Removed one of the semaphores from OverlapThread as it was redundant
      Completed threading work. Removed LockedQueue class as it is not used in 
the threading module.
      Minor cleanup.
      Commiting experimental multi-input buffer threading code to transfer to 
work
      Committing experimental batch-scheduling algorithm for overlap threads.
      Refactored the batch model parallelization algorithm. Currently gives 
better performance
      Experimental paired end resolution code.
      Very experimental code for paired end resolution. Not at all a stable 
version - transferring to sanger to run on the farm to generate numbers.
      Enabled pairedoverlap visit
      More experimental PE code, transferring to work, will be reverted later.
      -Manually unrolled very-oftenly used loop in AlphaCount
      -Removed redundant calls to get the occurrence counts from the FM-index 
in BWTAlgorithm::updateBothR and updateBothL. Big improvement in speed.
      Refactored functions out of SGAlgorithms into SGPairedAlgorithms
      Refactored some visit algorithms into SGDebugAlgorithms
      Removed some debug code.
      Moved functions from BWTAlgorithms to OverlapAlgorithm
      Heavy refactoring. Moved the inexact overlap code to OverlapAlgorithms. 
Broke the huge, ugly _alignBlock function into more manageable chunks. Still 
needs some cleanup.
      Refactored the multi-alignment printing code from the oview program into 
a class (MultiOverlap)
      Added pileup functionality to multi-overlap.
      Added debug functions for detecting when edges are missed due to base 
calling errors and inexact overlaps.
      Added TransitiveGroup/TransitiveGroupCollection classes and a method to 
Vertex for constructing these.
      Added code to infer matches between different transitive groups
      Refactored out the Alphabet stuff from the SuffixTools dir into its own 
file in Alphabet.
      Refactored some logic from MultiOverlap to Pileup
      First go at base probability calculations.
      Added new quality/probability calculations for overlaps
      First go at performing transitive closure. Very heurestic and a bit hacky 
in places. Notably generated containments aren't handled well.
      More experimental code for detecting missing edges in the graph. Needs 
cleaning up.
      In-progress checkin. More robust algorithm for computing missing edges 
but not perfect yet.
      Fixed bug in missing edge inference where duplicate edges would be 
generated
      Removed some debug prints
      Fixed bug where vertex colors weren't being reset correctly in 
SGRealignVisitor::getMissingCandidates
      In-progress checkin. Added ability to include containment relationships 
in the graph.
      Re-worked seqcoord logic for clarity and to handle containment seqcoords.
      Fixed bug in OverlapAlgorithms where containments would be output many 
times
      Added function to write out the overlaps present in the graph.
      Started refactor to refactor alphabet data structures into their own class
      Re-enabled the realign visitor as the default operation to perform in 
debug mode.
      Added two experimental likelihood maximization functions to MultiOverlap
      Added error correction code that uses the partitions calculated in 
MultiOverlap
      Implemented a new partitioning method based on improving a global 
likelihood
      Added actual read correction and visitor to remove edges that have an 
error rate above a threshold
      Fixed bug in partitionLI and tweaked params
      Added field to output in debug visitor
      Added more experimental partitioning functions.
      Tweaks to partitioning code
      Better partitioning function based on splitting the overlap set via 
discrepent bases
      Tweaks to previous, wrapping up coding for the night and shifting working 
copy to sanger
      Added SeqTrie class
      Removed debug code
      Added functionality to the SeqTrie. Added function to Bigraph::vertex to 
construct it from overlaps.
      Worked on SeqTrie-based error correction. It performs much better than 
the previous partitioning based methods but it is unusable because the memory 
usage explodes because of the insertAtDepth() which cause a combinatorial 
increase in memory use.
      Tweaks to previous.
      Transfering code to home, slight modifications to previous
      Added samQC.py which parses a SAM/BAM file to output some error rate 
metrices. Mostly used to learn python
      Adding incomplete and non-functional SeqDAVG class to switch to sanger
      Implemented basic functionality of SeqDAVG
      SeqDAVG insert at depth working.
      More work on SeqDAVG
      Checking in exploratory code to work on it from work tomorrow.
      More experimental conflict resolution code
      saca.h/saca.cpp: Changed bucket data from int to int64_t to prevent wrap 
around for very large suffix arrays.
      SeqTrie: removed inefficient insertAtDepth function
      Created SGA/preprocess program which processes read files to remove 
low-quality subsequences and reads with ambiguous bases.
      Removed print statement from preprocess
      More changes to the experimental error correction code. Current method 
can resolve repeats quite well. Must be refactored
      Refactored error correction code into its own namespace in the new 
Algorithms directory
      Very good version error correction
      Added ability to sub-sample reads to preprocess
      Added missing includes
      Cleaned up some includes, made a stub class for the graph remodeling 
visitor
      Started work on graph remodelling code - added functions to discover the 
complete set of overlaps for a given vertex.
      Added error correction mode to assemble.
      Added output counter to error correct visitor
      Re-wrote Vertex::makeUnique
      Tweaked trimming
      Now checking error codes from pthreads creation routines
      Fixed bug in Match::infer when sequences are not the same length. The 
coordinates must be translated before setting the .seqlen property of the 
SeqCoord or else the isValid() assert will blow because the start/end may be 
out of range
      Added SQG format stub directory/files
      Wrote a file format to hold the assembly graph. It is implemented in the 
SQG/ subdirectory. Modelled after the SAM format.
      Fixed bug in preprocess where sub-sampling was not working properly.
      Added a pe-aware mode to preprocess. PE reads will now be discarded/kept 
together.
      Fixed bug in OverlapAlgorithm introduced during previous refactoring. 
Overlaps were being output for non-terminal right overlapblocks.
      Modified the inexact overlap detection algorithm to remove redundant seeds
      Integrated gzstream wrapper for zlib. Used it in the overlap step for the 
final ASQG output and the temporary hits files.
      Refactored the way ASQG records are output.
      Removed old unused TagValue code in SQG
      Wrote ASQG parser in SGUtil. Now used to read in the graph.
      Fixed parsing bugs, string graph with substring verts now loads and 
builds cleanly.
      Fixed oview to use ASQG input. Removed unused functions.
      Created wrapper for opening a gzip or non-gzip file.
      Created createWriter wrapper in Util to open a gzip or plaintext file 
writer. Used it in SGA/overlap
      Use createWriter/createReader in BWT
      Added new OverlapAlgorithmNew as a re-worked OverlapAlgorithm. This is 
temporary and will be merged soon
      Replaced OverlapAlgorithm with the new, faster seeded algorithm that was 
developed in OverlapAlgorithmNew.
      Added --edge-stats command to assemble which outputs the distribution of 
overlap lengths and number of differences
      Added options to SGA/overlap to explicitly set the seed length and 
stride. These allow for more aggressive seeding (and lower computational time) 
but break the guarantee that all overlaps within epsilon are found. They are 
not used by default and fairly experimental.
      Fixed incorrect timing of collapsing seeds
      Fixed output in transitive reduction to accurately report the number of 
edges and vertices marked.
      Cleaned up OverlapAlgorithm for irreducible overlaps. Preparing to 
implement full, inexact irreducible algorithm.
      Fixed missing include.
      Added SearchHistory classes, transfering code to work
      First pass at inexact irreducible algorithm. Some transitive edges in the 
test set I am using remain but the majority are culled.
      Added function to write out an ASQG from Bigraph
      Fixed fencepost error in SearchHistory compare
      Fixed bug in SearchHistory calculation
      Fixed careless bug in SearchHistory
      Merge branch 'master' of ssh://127.0.0.1:2222/~/work/git_repository/sga
      Fixed bug in inexact irreducible object. If multiple overlap blocks are 
the same length, some transitive blocks may not get marked.
      Preliminary implementation of contained vertex resolving algorithm. This 
is a debug version and will be changed in a subsequent commit
      Working version of transitive-aware contain algorithm. This algorithm is 
much cleaner than the previous version but the implementation must be cleaned 
up.
      Separated EdgeDesc into its own file.
      Modified EdgeDesc to use a pointer to a vertex instead of a vertex ID
      Rewrote overlap/edge inference algorithms to use EdgeDesc instead of 
Vertices. It is important to track the directionality of the edges as weird 
palindromic
      Returned FUZZ parameter in SGTransRedVisitor to default value of 10.
      Cleaned up interface to enqueueEdges
      Added function to Util to make the floating point comparison between two 
error rates while allowing for a small tolerance.
      Do not allow containment edges in Vertex::getEdges(dir)
      Renamed the SGTransRedVisitor to SGTransitiveReductionVisitor
      Added graph structure validation visitor to find cases where the 
irreducible edges are missing from a vertex or erroneously found.
      Added oview2fa.pl tool
      Changed the order than vertices are remodelled in the ContainRemove 
visitor to visit the neighbors in order of length. This
      Removed missed print statement
      Rewrite of the findOverlapBlocksInexact algorithm. This is somewhat 
cleaner and a bit faster than the previous method. More cleanup/improvement is 
possible.
      Big improvement to inexact overlap, only branch the search seed after its 
interval is valid to avoid a big unnecessary copy.
      Added ability to randomly change Ns to bases in preprocess so that 
discarding reads can be avoiding. It is turned off by default.
      Implemented reference-counted search tree
      Refactored all the search history classes into one file. Added function 
to get the history from a SearchHistoryLink
      Integrated new SearchHistory tracker into the SearchSeeds.
      Re-enabled the list version of the inexact overlap algorithm instead of 
the queue version.
      Removed dead code.
      Tweaked preprocess GC filter.
      Fixed output in overlap align loop
      Added BWTDiskConstruction stub and command line arguments to index
      Implemented the control flow for the bwtdisk algorithm
      Refactored BWT class, moved the reader/writer logic into seperate classes 
to allow them to be used by the BWTDisk construction algorithm.
      Removed some more dead code from BWT
      Implemented merging of a bwt in memory with a bwt on disk.
      BWT merging now working, still need to merge the sai and track the 
relative ordering of read ids.
      Changed constant in disk algo
      Minor cleanup.
      Changed the ordering of equal strings from ID comparison to index 
comparison. This makes it far simpler to merge BWTs on disk.
      Cleanup of BWTDiskConstruction code
      More cleanup.
      Added merging of suffix array index to disk construction. Now fully 
functional.
      Factored the Reader/Writer logic out of the SuffixArray class to use it 
in the disk construction.
      Changed constant.
      Added flag to disk construction to build the reverse index. This 
completes the algorithm -
      Re-formatted entire source tree to use spaces instead of tabs.
      Factored the visitor algorithms out of SGAlgorithms into their own file.
      Merged SGAlgorithms::_discoverOverlaps and SGAlgorithms::addOverlapsToSet
      Refactoring.
      Rewrote the remodelAfterExcision function to use the newly developed 
EdgeDescOvermapMap code. It needs refactoring
      Fixed bug introduced to OverlapAlgorithm a few checkins ago. The 
seed_length should be clamped at minOverlap.
      Started to refactor the overlap collection logic out of SGAlgorithms into 
CompleteOverlapSet
      Fixed a bug in CompleteOverlapSet, it now behaves exactly as if all 
overlaps within the parameters were found using the FM-index (as desired). 
Changed the remodel visitor to use it.
      More refactoring, all the overlap discovery algorithms have been moved to 
CompleteOverlapSet.
      Fixed bug where the graph error rate parameter was not being set after 
remodelling.
      Fixed potential memory leak in irreducible algorithm.
      Re-implemented a cleaner version of the inexact irreducible algorithm in 
OverlapAlgorithm
      Started work on handling substrings in irreducible algorithm. 
Unforunately it seems that we will have to load substrings into the graph and 
then remove them - they can't be determinstically removed at
      Added a default value for the minimum read length
      Wrote core code for resolving the path between the ends of a PE fragment
      Added function to write result of fragment completion algorithm to file
      Added new graph parameters to specify whether the graph has containments 
and/or transitive edges
      Write out containment/transitive tags in Bigraph::writeASQG
      Progress on handling substring vertices.
      More work on substring containments, closer to giving the same results as 
exhaustive algorithm but not perfect.
      Added isContainment property vertex to signal that it needs to be removed 
from the graph instead of setting a color.
      Tweaked setting for sampled reads
      large refactoring, remodelling the graph properly handles generated 
containment and substring edges.
      Cleaned up some code, added new visitor to (trivially) remove identical 
reads.
      Resurrected recursive overlap map construction for debugging long running 
time in yeast case
      Began the implementation of the rmdup subprogram. Refactored 
OverlapAlgorithm so minOverlap is not a member variable but passed into the 
relevant algorithm to run.
      More refactoring.
      Refactored the hit computation code into its own file
      More refactoring and the first working version the rmdup
      Fixed bug in Vertex where the containment flag was not being set in the 
constructor.
      Created files for merge subprogram to merge multiple BWTs.
      Removed test code from read sampler that should not have been checked in.
      Implemented merging of indices from two different read files.
      Added flag to merge reverse indices
      Wrote function to merge two read files together
      Cleaned up outfile naming in SGA/merge, it is now complete in the case of 
merging two indices
      Modified read sampler to add a prefix to each readname
      Implemented new overlap detection algorithm in CompleteOverlapSet
      Started work on 2 bit per base encoded string class
      Implemented the rest of the EncodedString class
      Implemented append and swap functions in EncodedString. Ported the Vertex 
class to use this class to store the sequence.
      Added BWTCodec to encode an alphabet of ACGT$
      Changed the BWT class to use the EncodedString representation of the BWT 
string.
      Added lookup table for shift values and changed value in mask from 
decimal to hex
      Added NoCodec which can be used by EncodedString to avoid doing any 
actual encoding. Useful for testing.
      Changed NoCodec to use a similar get/store function as the real codecs.
      Added 4-bit BWT codec. It uses half the memory compared to not encoded 
the string for roughly the same speed. It is faster than the 3-bit encoder for 
unknown reasons.
      Added option to merge to clean up original files.
      Don't load the reverse read table in SGA/overlap, only use the forward 
read table.
      Changed interface to parseHits so that the reverse read table is not used.
      Changed comment
      Added check to SGA/merge and revised mergeDriver tool
      Reduced the memory usage of Vertex by using the SimplePool allocator and 
removing two data members that are not used currently.
      Major refactoring of how a sequence file is processed in parallel. Wrote 
the generic SequenceProcessFramework to handle reading the file
      Refactored rmdup to use the new concurrency framework. Removed 
OverlapThread which is now obsolete
      Moved some print messages to SequenceProcessFramework
      Generalized the SequenceProcessFramework to take in a SeqReader and an 
optional parameter n which limits the number of
      Removed some prints
      Modified BWTDiskConstruction to use SequenceProcessFramework. This 
involved refactoring the GapArray into its own file.
      Reverted the number of reads per group back to 2M
      Removed double-construction of overlap block
      Fixed a few incorrect forward declares that Clang picked up.
      Implemented new, much faster remodel algorithm. The implementation is not 
perfect yet.
      Fixed last issue with the new remodel algorithm, it now gives the same 
result as the old algorithm but is much much faster.
      Refactored CompleteOverlapSet to use new partitioning code
      Added some void casts so the program compiles without warnings if DNDEBUG 
is specified
      Added skeleton for error correction subprogram
      Added skeleton of ErrorCorrectProcess, implemented control flow for error 
correct subprogram
      Merge branch 'master' of /nfs/team71/phd/js18/work/git_repository/sga
      Added methods to SearchHistoryVector and OverlapBlock for extracting the 
string corresponding to a match
      Implemented the rest of the error correction subprogram. It uses the 
simple correction algorithm at the moment but gives good results on simulated 
data.
      Began implementation of run-length encoded BWT class. It reads from a 
.bwt file and compresses the string into runs as it is read.
      Rewrote RLBWT printInfo
      Started to implement the marker placement code
      Implemented setting the markers in the RLBWT and random accessing of 
elements.
      Implemented occurrence counting for RLBWT. Some efficiency gains can 
still be made
      Renamed the old BWT class to SBWT ("simple" BWT). The BWT identifier is 
now a typedef to switch between using the RLE version and the regular version
      Implemented forward-search of the Marker array
      Implemented forward search in getFullOcc as well. The code could be 
cleaned up a bit.
      Fixed bug in RLBWT::initializeFMIndex where the last marker would not be 
placed correctly.
      Added ReadInfoTable to load an index of id,length pairs. This is used to 
construct overlaps from hits in overlap and rmdup. The benefit here
      Added hidden argument to overlap to use exact mode.
      Fixed but in RemovalAlgorithm where cycles in the graph would cause an 
infinite loop.
      Re-enabled rmdup by writing the id and sequence out to the hits file.
      Force the suffix array to BWT conversion methods to use SBWT for now. 
This should eventually change to writing the RLBWT
      Added hacky BubbleEdge removal visitor. Currently not in use.
      Made the number of reads to process in a batch a parameter to the BWT 
disk construction algorithm
      Added subgraph subprogram, to extract a specified portion of the graph
      Cleaned up subgraph, it now removes containments and properly handles the 
vertex visit logic
      Changed tabs to spaces in samQC, modified so the summary stats can be 
printed in every mode.
      Implemented -o, --outfile option to SGA/correct
      -Added option to perform multiple rounds of error correction
      Added quality filtering option to remove reads with a substantial amount 
of low-quality bases.
      Fixed semantics of quality filter
      Fix: the number of times the trim/bubble popping is performed did not 
match the command line parameter
      Removed print that was checked in by error
      Added some extra information to the break writer
      Added small-repeat resolution code. Remove edges that join together two 
sequences with a sub-read length repeat unit if there are
      Added method to MultiOverlap to generate SeqTries.
      Re-enabled seqtrie correction.
      Added quick and dirty PrimerScreen class and enabled the screen in the 
preprocess. This just checks for
      Tweaked PrimerScreen settings. Now matches over the first 14 bases of the 
sequence.
      Added tool scripts to revision control
      Added command line arguments to correct to take the in the algorithm to 
use and the conflictCutoff
      Merge branch 'master' of ssh://127.0.0.1:2222/~/work/git_repository/sga
      Removed print statements from SGRepeatResolveVisitor
      First pass at the scaffold driver. This version is based on bwa but this 
will be replaced by bowtie
      BWA-based distance estimation calculation is complete.
      Added new conversion scripts:
      Added some metrics to the error correction.
      Adding additional development/analysis scripts to revision control
      Removed hardcoded paths from analyzeCorrect, run_bwa.sh and samQC
      Added function to calculate the amount of a read that is covered
      Changed OverlapAlgorithm to remove submaximal overlap blocks for 
containments and proper overlaps at the same time
      Refactored BWTAlgorithms::updateBothL/R to take in the AlphaCount for the 
lower and upper interval.
      Working implementation of binary .bwt file. Uses run-length encoding.
      Forgot to add RLBWT* files
      heavy refactoring of BWT I/O. Now the binary and ascii output files are 
subclasses of IBWTReader/Writer. The rest of
      Fixed a crash in the contain removal algorithm found in the yeast data. 
An assertion would blow if multiple valid overlaps
      Added sparse hash check to configure
      Bumped version number, removed define from SGA.cpp
      Fixed Makefiles/includes so that make dist works
      Removed Tests directory from standard build.
      Initial implementation of in-place removal of strings from the FM-index. 
Not working in the version.
      Working version of the removal of reads from the FM-index in the rmdup 
program.
      Fixed parallel mode for rmdup. Now working as designed.
      Changed output BWTs back to binary
      Fixed rmdup index rebuild. Substring reads were not being written to the 
dup file.
      Added checks to the StringGraph construction and oview functions to 
ensure that each
      Refactored HashMap includes to check for the precense of tr1, 
ext/hash_map, etc.
      Merge branch 'master' of /nfs/team71/phd/js18/work/git_repository/sga
      Added ability to load distance estimate edges to the ScaffoldGraph.
      Improved dot output
      Removed unused line of code
      Abstracted out the GapArray functions.
      Implemented 4-bit storage SparseGapArray
      Added arguments to index and merge to control the size of the gap array.
      Fixed order of arguments when creating a ScaffoldEdge.
      Cleaned up MultiOverlap code and removed dead code from other classes
      Added Metrics classes and ability to track statistics about what 
positions in
      Removing Scaffold/Makefile.in which shouldnt have been added to the 
tracking
      Updated the output after error correction
      Merge branch 'master' of ssh://127.0.0.1:2222/~/work/git_repository/sga
      Fixed help message for merge to indicate it can only take 2 files.
      Added script to compute a-statistic for contigs from a bam file
      Better estimate of expected number of reads per contig by using the 
number of positions in the read that
      Minor update to calculation for expected arrival rate
      Added abilitiy to load a-statistic data from a file to sga-scaffold.
      In-progress checkin of scaffolding code. It compiles but should not be 
used.
      Merge branch 'master' of /nfs/team71/phd/js18/work/git_repository/sga
      Fixed terrible bug in scaffolder
      Added ability to write out scaffolds to a file after processing. Still in 
development.
      Fixed bug where an istream* reader was not cleaned up.
      Added script to evaluate the scaffold output
      Added ability to output singletons from the scaffolder
      Added -o,--outfile option to sga-scaffold to specify output file.
      Ported sga-scaffold into the main SGA program as a subcommand
      Added -a,--asqg-outfile option to assemble to write out the final graph.
      evalScaffolds: output number of gaps and mean gap size
      Factored the link data out of the ScaffoldEdge class
      More refactoring. Created the ScaffoldRecord class to hold the output of 
the scaffolding process. It can be read from/written to a file.
      Implemented simple scaffolding output where sequences are truncated if an 
overlap is predicted.
      Modified scaffold to perform reductions until no more reductions can be 
made
      Implemented edit distance calculation for two strings using dynamic 
programming in OverlapTools.
      Refactored dynamic programming algorithm into its own class.
      Finished code to join contigs that are predicted to overlap
      Added perl script to break up a set of scaffolds into contigs
      Added a driver script for the scaffold evaluation.
      Turned off prints in Overlapper
      Fixed bug in scaffold evaluation
      Refactored vertex to vertex search algorithms
      Fixed major performance bug in the irreducible extension algorithm. Every 
right extension was performing 4 branches,
      Implemented scaffold resolution using the string graph.
      Finished graph resolving work. Added command line parameter to choose the 
stringency of the resolution step.
      Made the sequence process framework more generic by using an input 
generator
      Added --no-discard flag to sga correct to suppress discarding reads.
      Fixed int to double conversion warnings.
      Started implementation of local string graph construction. It currently 
generates duplicate edges
      Threaded the mkqs portion of the indexing step. Not a huge decrease in 
running time, around 20%.
      Removed unnecessary print.
      Changed default sample rate for merging to 1024
      Completed connect subprogram.
      Fixed bug in the connect subprogram where the program would abort if the 
first and second reads had identical sequences
      Added experimental bubble-popping "smoothing" algorithm. Not in a state 
that it is usable for production work.
      Added some parameters to merge and correct. Moved the smoothing task in 
assemble to occur before simplification. Smoothing is still experimental.
      Fixed seg fault in search where the m_pWalkIndex member of an SGWalk was 
not initialized in the copy constructor
      Added depth filter to the error correct process to avoid correcting very 
deep sequences, which takes a lot of time.
      Added function to SGSearch to calculate the coverage spanning a given 
edge.
      Made command line parameters for the coverage removal algorithm
      Added sampleRate parameter to rmdup
      Added parameter to the correct subprogram to limit the amount of 
branching for complex reads.
      Fixed bad memory leak in the branch cutoff code for the overlap algorithm
      Started work on kmer-based error correction
      Initial checkin of sga2afg script
      Fixed bugs in the sga2afg and sga2contig scripts
      Added development script for computing an FM-index from a polymorphic 
genome
      Updates on the sga2afg convertor script and the testing graphical 
fm-index python script
      First pass at k-mer error corrector
      Improved kmer-correction. Gives very close results to the overlap 
correction.
      Rewrote vertex/edge allocation logic so that a global memory pool is not 
used. The pool now belongs to the graph that creates the vertex/edge. This 
allows multiple graphs to be created in different threads without stomping over 
each other's memory. The global new for Vertex/Edges is disabled, the 
allocations must go through a pool.
      Re-enabled threading in the connect process since the memory pool issues 
are fixed.
      Merge branch 'master' of /nfs/users/nfs_j/js18/work/git_repository/sga
      Changed the read discarding logic for the kmer error corrector
      Fixed bug in error corrector where no sequence would be output for 
uncorrected reads in the kmer algorithm.
      Refactored repository to not contain data files and tools/analysis 
scripts. These are moved to the sgatools repo
      Updated the README
      Removed dead code from repository
      More dead code removal
      More README updates
      Implemented hybrid mode error correction which first performs a kmer 
correction pass, then overlap correction.
      Made the irreducible-edge only algorithm the default for sga overlap. All 
overlaps can be generated using the -x/--exhaustive option.
      Bumped version to 0.92
      Added assert ScaffoldRecord::introduceGap to catch case where the 
expected overlap between scaffold components is not sane.
      Minor changes to the README
      Merge branch 'gh-pages' of github.com:jts/sga
      Rewrote sga main webpage.
      Obscured email address
      Fixed formatting
      Added bin directory with first version of sga-pipeline script
      Added pipeline script information to the README
      Corrected file extension handling in sga-pipeline
      Rewrote sga-pipeline to be more modular and flexible
      add rmdup-pe workflow to sga-pipeline to remove duplicated paired-end 
reads
      Added logging to the sga-pipeline script
      sga-pipeline: fixed formatting issue for rmdup and correct wrappers
      Modified SeqReader to read compressed fasta/fastq files
      Modified SeqReader to automatically uppercase all input sequences.
      Extended --permuteN option in preprocess to handle the full IUPAC 
ambiguity code set as suggest by Shaun Jackman.
      Cleaned up help message for many subprograms, mostly by adding default 
parameters.
      Changed version numbering to a conventional x.x.x scheme and bumped 
version to v0.9.3
      Updated README with new name of the --trim option
      Added sga connect workflow to sga-pipeline
      Added --skip-preprocess option to sga-pipeline
      Added --version option to sga main program
      Implemented sga qc subprogram. This program looks for, and discards, 
problematic reads. Right now, the qc check requires each read to have a tiling 
of high confidence k-mers (with a short kmer length).
      Added new output file to sga-connect to record the pe reads that could 
not be connected
      Added new subprogram sga-stats which prints out a histogram of the kmer 
counts for a read set.
      Implemented gmap subprogram which is a very basic read-read mapper.
      Added flag to gmap output to indicate reverse complement alignments
      Rewrote sga-connect to work from the graph instead of the FM-index.
      Update the new sga-connect program to mark vertices in the graph that are 
covered by a pe-walk
      Rewrote Util/HashMap.h logic to explicitly define the StringHasher 
function. This is to fix a problem where tr1::unordered_map was available but 
the sparsehash was still trying to use __gnu_cxx::hash<std::string> which does 
not exist.
      Implemented edge link update function in scaffold module
      Cleaned up output in bigraph and assemble.
      Added sga-align and sga-deinterleave helper scripts
      Added new statistics to sga-stats. Now outputs the estimated error rate 
in the reads and the mean overlap depth.
      Rewrote portions of the MultiOverlap correction code for efficiency
      Added structural variation detection options to sga-connect
      Fixed bug in the bubble popper. The counter would never be incremented so 
it would always be reported that no bubbles were popped.
      Fixed string initialization error spotted by valgrind
      Added --with-hoard=PATH option to configure to allow the use of the Hoard 
memory allocator.
      Minor formatting change in configure
      Added --run-lengths parameter to sga-stats to print the run length 
distribution of the BWT
      Fixed typo in README spotted by Matthias Haimel. Added instruction for 
running autogen.sh
      Added a numReads field to the header of the sga-connect output
      Rewrote AlphaCount class to take in a template parameter indicating the 
storage size. Replaced all existing uses of AlphaCount in the code with 
AlphaCount64, the 64-bit storage version.
      Complete re-write of how the BWT occurrence array markers are represented.
      Removed old marker code and cleaned up.
      Fixed error in SmallMarker - was using size_t to hold the unitCount when 
it will be at most 128. Changed to uint8_t which for a huge memory saving.
      Cleaned up two-tier code.
      More clean up of two-tier code.
      Removed unused print statements in getInterpolatedMarkers
      Removed gcc force-inline attributes
      Implemented second version of two-tier occurrence array markers.
      Fixed bug in two-tier implementation where the count for the last 
SmallBlock placed was incorrect.
      Changed default sample rate for merging bwts
      Added a method to read in the non-RLE BWT from a binary bwt file.
      Updated version to 0.9.4. The main difference in this version is an 
improved strategy for managing the Occurrence array in the BWT, which requires 
substantially less memory.

jts (1):
      github generated gh-pages branch

-----------------------------------------------------------------------

-- 
Debian packaging for sga

_______________________________________________
debian-med-commit mailing list
[email protected]
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/debian-med-commit

Reply via email to