Asha Rostamianfar created BEAM-5844:
---------------------------------------
Summary: Transition VCF IO to use Nucleus
Key: BEAM-5844
URL: https://issues.apache.org/jira/browse/BEAM-5844
Project: Beam
Issue Type: Task
Components: sdk-py-core
Reporter: Asha Rostamianfar
Assignee: Asha Rostamianfar
Currently, vcfio.py uses [PyVCF|https://github.com/jamescasbon/PyVCF] as its
parser. Even though it's one of the popular VCF parsers, it is not actively
maintained. There are also python3 compatibility issues (see BEAM-5628). There
is a new FOSS parser from the Google Brain team, called
[Nucleus|https://github.com/google/nucleus], that we can use instead. It has
other nice features like built-in protocol buffer support so that we no longer
need to transform the internal structures into Variant objects (we can
deprecate the existing Variant/VariantCall classes in favor of using the
protos).
The Google Cloud Healthcare & Life Sciences team is planning to switch to using
Nucleus as its parser for the [Variant
Transforms|https://github.com/googlegenomics/gcp-variant-transforms] tool. Once
that is done, we'll sync the [vcfio.py
code|https://github.com/googlegenomics/gcp-variant-transforms/blob/master/gcp_variant_transforms/beam_io/vcfio.py]
back to the Beam SDK so that the wider community can use it as well
(potentially with additional features, like ReadAllFromVCF and VCF sink).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)