Hello, CD-HIT can remove redundancies from sequence files, the sequences do not need to be aligned. http://weizhongli-lab.org/cdhit_suite/cgi-bin/index.cgi?cmd=cd-hit
Andreas >-----Original Message----- >From: jalview-discuss-boun...@jalview.org [mailto:jalview-discuss- >boun...@jalview.org] On Behalf Of Jim Procter >Sent: 19 October 2016 07:47 >To: jalview-discuss@jalview.org >Subject: [Jalview-discuss] Redundancy removal for large sets of sequences [was >Re: Problems installing and then running Jalview on Windows 10 ] > >Hi Kausik. > > >On 18/10/2016 19:33, Kausik Datta wrote: >> What I next needed to see is whether this pipeline can handle a FASTA file >with similar AND dissimilar sequences in it. Unfortunately, Jalview tried to >align >all sequences in it, introducing gaps in the middle (naturally) which threw off >the redundancy removal process also. I was able to partially remedy this by >creating groups of similar sequences, but then I had to do the alignment & >redundancy for each group separately. >This does sound like a limitation with the percent-identity measure used, since >it calculates the degree of similarity including gapped columns (something that >Jalview has done from the beginning). For 100% identity, however, it is >actually >unlikely to matter, since for any reliable alignment algorithm, sequence >fragments will be aligned in the same way as the full length sequences. > >Could you give us a little more background ? If it is purely about removing >'identical' fragments, then the '100%' removal will work because a >subsequence and its full length counterpart will be 100% identical regardless. > >> What I have come to realize is that there is probably no single program that >can help me do what I am trying to achieve: remove redundancies from a large >FASTA file. >You may be correct here - Jalview's redundancy removal function was only >designed for use in comparative analsis. There are some standard methods for >performing this filtering, of course (it's a common step for any sequencing >pipeline) - but again, it depends what you you are trying to achieve ! > >Does anyone else have any suggestions to help Kausik ? >Jim. > > > >-- >------------------------------------------------------------------- >Dr JB Procter, Jalview Coordinator, The Barton Group Division of Computational >Biology, School of Life Sciences University of Dundee, Dundee DD1 5EH, UK. >+44 1382 388734 | www.jalview.org | www.compbio.dundee.ac.uk > >_______________________________________________ >Jalview-discuss mailing list >Jalview-discuss@jalview.org >http://www.compbio.dundee.ac.uk/mailman/listinfo/jalview-discuss _______________________________________________ Jalview-discuss mailing list Jalview-discuss@jalview.org http://www.compbio.dundee.ac.uk/mailman/listinfo/jalview-discuss