Hello Everyone, I came across a "transcript-based" VCF file, meaning a variant can be present multiple times but belonging to a different transcript. See "FIle 1" below as an example. I am finding myself in the unfortunate situation of having to intersect ("File 2") and retain all records with the same position and REF/ALT ("Desired output"). Long shot: Is that possible?
Thanks, Thomas File 1 ##fileformat=VCFv4.2 ##fileDate=20090805 ##contig=<ID=20> ##INFO=<ID=TRANSCRIPT_ID,Number=1,Type=Integer,Description="ID of transcript"> ##INFO=<ID=GENE_ID,Number=1,Type=Integer,Description="ID of gene associated with transcript"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 rs6054257 G A 29 PASS TRANSCRIPT_ID=1;GENE_ID=1; GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. 20 14370 rs6054257 G A 29 PASS TRANSCRIPT_ID=2;GENE_ID=1; GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. 20 17330 . T A 3 q10 TRANSCRIPT_ID=1;GENE_ID=2; GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3 20 17330 . T A 3 q10 TRANSCRIPT_ID=2;GENE_ID=2; GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3 File 2 ##fileformat=VCFv4.2 ##fileDate=20090805 ##contig=<ID=20> ##INFO=<ID=TRANSCRIPT_ID,Number=1,Type=Integer,Description="ID of transcript"> ##INFO=<ID=GENE_ID,Number=1,Type=Integer,Description="ID of gene associated with transcript"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 rs6054257 G A 29 PASS TRANSCRIPT_ID=1;GENE_ID=1; GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. Desired output ##fileformat=VCFv4.2 ##fileDate=20090805 ##contig=<ID=20> ##INFO=<ID=TRANSCRIPT_ID,Number=1,Type=Integer,Description="ID of transcript"> ##INFO=<ID=GENE_ID,Number=1,Type=Integer,Description="ID of gene associated with transcript"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 rs6054257 G A 29 PASS TRANSCRIPT_ID=1;GENE_ID=1; GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. 20 14370 rs6054257 G A 29 PASS TRANSCRIPT_ID=2;GENE_ID=1; GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:., _______________________________________________ Samtools-help mailing list Samtools-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/samtools-help