Hello Everyone,

I came across a "transcript-based" VCF file, meaning a variant can be
present multiple times but belonging to a different transcript. See
"FIle 1" below as an example. I am finding myself in the unfortunate
situation of having to intersect ("File 2")  and retain all records
with the same position and REF/ALT ("Desired output").
Long shot: Is that possible?

Thanks,
Thomas

File 1
##fileformat=VCFv4.2
##fileDate=20090805
##contig=<ID=20>
##INFO=<ID=TRANSCRIPT_ID,Number=1,Type=Integer,Description="ID of transcript">
##INFO=<ID=GENE_ID,Number=1,Type=Integer,Description="ID of gene
associated with transcript">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT
 NA00001 NA00002 NA00003
20      14370   rs6054257       G       A       29      PASS
TRANSCRIPT_ID=1;GENE_ID=1;      GT:GQ:DP:HQ     0|0:48:1:51,51
1|0:48:8:51,51  1/1:43:5:.,.
20      14370   rs6054257       G       A       29      PASS
TRANSCRIPT_ID=2;GENE_ID=1;      GT:GQ:DP:HQ     0|0:48:1:51,51
1|0:48:8:51,51  1/1:43:5:.,.
20      17330   .       T       A       3       q10
TRANSCRIPT_ID=1;GENE_ID=2;      GT:GQ:DP:HQ     0|0:49:3:58,50
0|1:3:5:65,3    0/0:41:3
20      17330   .       T       A       3       q10
TRANSCRIPT_ID=2;GENE_ID=2;      GT:GQ:DP:HQ     0|0:49:3:58,50
0|1:3:5:65,3    0/0:41:3


File 2
##fileformat=VCFv4.2
##fileDate=20090805
##contig=<ID=20>
##INFO=<ID=TRANSCRIPT_ID,Number=1,Type=Integer,Description="ID of transcript">
##INFO=<ID=GENE_ID,Number=1,Type=Integer,Description="ID of gene
associated with transcript">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT
 NA00001 NA00002 NA00003
20      14370   rs6054257       G       A       29      PASS
TRANSCRIPT_ID=1;GENE_ID=1;      GT:GQ:DP:HQ     0|0:48:1:51,51
1|0:48:8:51,51  1/1:43:5:.,.

Desired output
##fileformat=VCFv4.2
##fileDate=20090805
##contig=<ID=20>
##INFO=<ID=TRANSCRIPT_ID,Number=1,Type=Integer,Description="ID of transcript">
##INFO=<ID=GENE_ID,Number=1,Type=Integer,Description="ID of gene
associated with transcript">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT
 NA00001 NA00002 NA00003
20      14370   rs6054257       G       A       29      PASS
TRANSCRIPT_ID=1;GENE_ID=1;      GT:GQ:DP:HQ     0|0:48:1:51,51
1|0:48:8:51,51  1/1:43:5:.,.
20      14370   rs6054257       G       A       29      PASS
TRANSCRIPT_ID=2;GENE_ID=1;      GT:GQ:DP:HQ     0|0:48:1:51,51
1|0:48:8:51,51  1/1:43:5:.,


_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to