This is an automated email from the ASF dual-hosted git repository.
nchung pushed a change to branch master
in repository
https://gitbox.apache.org/repos/asf/incubator-sdap-in-situ-data-services.git.
from b68a24f SDAP-354: Import initial version of in situ data services
new d237005 Initial commit
new 23a0f94 feat: add WIP parquet flask server
new 52a6058 chore: some updates
new 3efd940 feat: wip: update spark to upload file to s3
new a877b29 feat: upgrade to hadoop 3.2.0 for s3 connection
new ae7c43d feat: update spark config library for aws
new 867b392 feat: update code for flask to work with cluster
new 9003fbd fix: update PythonUtils.getPythonAuthSocketTimeout error
new 6adb579 feat: add file logger
new 32ca75f feat: avoid commiting aws credentials
new 2afb10a feat: add endpoint to download file from s3 to upload to
parquet
new 88bb3c6 fix: need aws variables in env + update file delete function
new 152949f feat: add query logic + refactor some code
new fbc5c70 feat: add query endpoint to flask
new 1df354b feat: add tag method + set url first before executing methods
new 38012aa feat: add year & month for partitions
new 7ac0487 feat: tag ingested s3 objects
new 1e50ed7 chore: update formate for easier readability
new 7ec98e5 feat: update schema + add schema for parquet
new 07807b0 chore: add more files to ignore
new 45014ec breaking: update ingest logic with updated schema
new 2697d99 breaking: update ingest endpoint with updated schema and logic
new b769827 fix: disable validating individual observation record coz it
is too slow
new 1c92bca chore: move the logger statement outside of the loop
new 765e966 feat: add new query class which can handle more queries &
pagination
new 8ad6b3c breaking: update endpoint to work with new query class
new cef3f20 fix: use pyspark to remove columns
new 21998d2 feat: allow query to choose only specific columns
new 532b3f1 feat: use append mode again + partition by job-id + remove
ingested date
new d9864b2 fix: update schema and parse method to include columns
new d632b45 feat: add method & endpoint to replace file
new edf682e chore: update readme
new c32b2ad feat: update aws creds w/o token
new ec4ce11 fix: s3-download parameter is in wrong order + update to
custom ports
new 6cd04f5 fix: add log stmts + fix typo
new 703440f chore: update log stmt
new a482b06 feat: check session_token is null b4 setting it
new a412dad chore: add s3 log stmts
new 3fb0c7b fix: wrong argument when calling s3 download method
new 086818f chore: more log stmt
new 4af5f92 fix: need to re-use initialized s3 class. not new one
new 19af269 fix: schema is updated. where to find data types also need to
be updated
new 983289a feat: use fastjsonschema + parallel process to validate large
json arrays
new 3a8f351 fix: calling s3 class twice in replace_json_s3 endpoint
new ee85c85 feat: add get endpoint + add doms compatible endpoint
new 428b09f feat: add platform_code,variable,quality_flag
new ac12689 feat: adding platform code to the columns + make it a
partition
new 41f4066 feat: adding ddb logic (wip) + refactor how to receive aws
cred
new e11eddd feat: finished creating ddb classes
new 5e5129b feat:add metadata to ddb tbl
new ad1d2be feat: unzip s3 file if it is zipped
new 71dc71e feat: extract ingest aws json file logic to its own class
new c124480 fix: allow get method + start_from & size needs to be int
new 48884f1 fix:add unique temporary folder + create it
new 03e585f feat: add pagination to doms response
new 3a323c8 fix: resource & client are not methods
new 53ffde1 fix: insert record in ddb needs to update logic
new 8ca9eb6 chore: update ddb class name
new aced8af fix:if expecting millisecond, convert to float first
new c37460d feat: validate ingest/replace against DDB first + add stream
logger + get log_level from env
new 09c1ec3 fix: attempt to reduce query time
new 88767eb breaking: upgrade to python3.9 dependencies + junk code to
try to speed up Parquet
new e09f47f fix: hardcoding total number for now
new 91451cf chore: small tweak to spark executor RAM to compare
performance
new d2ab730 chore:increase more resources
new bda09cf feat: map local directory to all services in docker-compose +
change Parquet storage to local
new 3f879c9 fix: adding missing platform_code
new 3034197 fix: update spark parameters for k8s spark cluster
new 123afcb Initial SwaggerUI deployment + OpenAPI spec
new b6e2084 Fix startTime/endTime examples
new 5425432 Merge pull request #1 from access-cdms/CDMS-79
new 1a565aa Merge branch 'master' of
github.jpl.nasa.gov:access-cdms/in-situ-data-services
new 71f5fda feat: add month to sql filter + s3 list children method
new b9ca714 feat: add provider, project to doms api
new 06987e5 feat: validate sha512 before ingestion
new e7c75dd fix: sha512 bug + disable tagging + add more info in response
new 235b3ca chore: add more details on response
new e5bd7db fix: need to extract sha512 b4 comparing
new bdab0af feat: update code for k8s spark + instruction to setup k8s
spark
new cd9a334 Added Apache 2.0 license.
new d5fb0c6 Merge branch 'master' of
github.jpl.nasa.gov:access-cdms/in-situ-data-services
new 4478192 feat: remove brackets in doms get parameters for bbox
new 5e3b2dc fix: replace json array with comma separated str for normal
query as well + update descriptions
new 4970959 chore: merge from forked_apache master
new 52b47e2 fix: relace big jar with text file
new d60d304 chore: remove old hadoop libraries
new 5cf4c86 chore: update ignore file to remove aws library + removed aws
jar from git history
new 7bf15ea chore: merged from origin master
new 321a05f fix: more options when connecting to spark
new 9ce9573 chore: use class variable to avoid typo
new 6db2449 fix: add spark.driver.host to talk to k8s sparl
new 60279ca feat: add simple k8s for parquet
new 1c43877 chore: update configmap value + move values.yaml to another
location
new f8a7b73 chore: use docker.io image tag
new 00954d2 feat: add jupyter notebook for demo
new 506ddda fix: add sample response
new 2ea30e7 chore: add more details
new 04fa46d feat: allow default boto3 session for iam base roles
new 9a27f94 chore: update file for pep-8
new 7167ab2 feat: prep for spark3.2.0
new d5f93c9 fix: allow aws token from secret file
new f379e8b feat: add missing depth value condition
new 1a092da chore: add raw query for debugging purpose
new 21b5e8f fix: allow spark config come from env file
new 29daaea feat: add extra spark setting
new 646a696 fix: validate NULL before type checking
new 1934c81 fix: update spark aws cred logic
new 9e89d68 fix: add missing depth condition with "OR" statement
new dcf09c9 chore: update values.yaml with EKS values
new ff244b6 chore: update readme
new 6342e5e chore: add helm scripts (in-progress)
new 215589f fix: spark_config_dict needs a dict + aws creds directly from
values.yaml now
new 22a3380 chore: saving progress
new ba849e5 fix: add condition to set empty secret or real one
new 812aa7d chore: add readme + update git ignore
new 800e3c4 fix: unable to query the service, only the pod before this fix
new 6afee71 feat: add auth header to ingest new files
new 11c532e fix: typos + update docker with file base auth for now
new 7756a99 feat: add comma separated variable and columns to the query
parquet logic
new 5d06697 feat: add column
new 375379e chore: update docker tag
new 2330ff4 fix: update docker-compose dockerfile
new 7362eb2 fix: ddb name comes from setting
new 5f0f6f9 chore: add deployment guide
new c9d8ca3 chore: move docker files to docker directory
new 5f412f6 chore: move jupyter notebook to documentaiton directory
new af71dba feat: add aws lambda code to ingest data from S3 to parquet
(not.tested)
new 1a7c5d4 chore: move flask server starting script to the module
new e1ef03f feat: add terraform (in.progress)
new ca2d1ef feat: add bench_mark tests
new 0d2c219 feat: add new type of bench_mark
new 6131ebc feat: add addition key to run ingest in background
new bc9090b fix: accepting s3_url from event for now
new e08faa5 fix: update ingest lambda header
new 67b5f1e fix: compute sha512 from original s3 file + bug on retrieving
optional parameter in ingest endpoints
new 139754b chore:adding documentation for performance issue
new 66c7931 chore: update rebuilding of images by shuffling the stacks
new bc0a17b fix: repartition to reduce number of ingested files in
parquet + overwrite vs. append for replace vs. insert
new 1bf8f4c chore: update time filter
new 98f8844 feat: add partition to the parquet path to increase speed
(#44)
new 1530115 fix: return correct size for last page (need count to do it)
new f27cf93 chore: add test result
new 6580a34 chore: remove comment + update test
new 672e41a feat: update swagger with latest changes
new 9788e7f fix: allow URL ending with `/` also works
new 29733d8 fix: disable redirect if `/` is not in the URL
new 496a5f3 fix: rename apidocs to a unique name to avoid weird bug
pointing to sdap swagger
new c78cb4b feat: remove old domains + add provider & project
new f02ddff feat: add platform code
new d459341 feat: multiple dataframe read & union all + multiple
selective month (#47)
new 190af10 chore: update benchmark results
new 9e36b5d feat: accept multiple platform values separated by comma (#54)
new 76bbc48 fix: update bug in dataframe union
new 4e65e0e Merge pull request #1 from wphyojpl/master
The 155 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails. The revisions
listed as "add" were already present in the repository and have only
been added to this reference.
Summary of changes:
.gitignore | 8 +-
Deployment-in-AWS.md | 117 +++
docker/parquet.spark.3.1.2.r70.Dockerfile | 19 +
docker/parquet.spark.3.2.0.r44.Dockerfile | 21 +
.../jupyter.notebooks/cdms.demo.2021.12.09.ipynb | 1089 ++++++++++++++++++++
.../in-situ-architecture-demo.png | Bin 0 -> 186934 bytes
.../jupyter.notebooks/in-situ-architecture.png | Bin 0 -> 170538 bytes
.../in-situ-parquet-partition.png | Bin 0 -> 45606 bytes
.../jupyter.notebooks/in-situ-s3-input-data.png | Bin 0 -> 439347 bytes
k8s_spark/README.md | 78 ++
k8s_spark/configmap.yml | 10 +
k8s_spark/flask-deployment.yml | 72 ++
k8s_spark/flask-service.yml | 15 +
k8s_spark/k8s_spark/values.yaml | 694 +++++++++++++
k8s_spark/parquet.spark.helm/.helmignore | 23 +
k8s_spark/parquet.spark.helm/Chart.yaml | 24 +
k8s_spark/parquet.spark.helm/README.md | 14 +
k8s_spark/parquet.spark.helm/templates/NOTES.txt | 22 +
.../parquet.spark.helm/templates/_helpers.tpl | 62 ++
.../parquet.spark.helm/templates/deployment.yaml | 97 ++
k8s_spark/parquet.spark.helm/templates/hpa.yaml | 28 +
.../parquet.spark.helm/templates/ingress.yaml | 41 +
k8s_spark/parquet.spark.helm/templates/secret.yaml | 9 +
.../parquet.spark.helm/templates/service.yaml | 16 +
.../templates/serviceaccount.yaml | 12 +
.../templates/tests/test-connection.yaml | 15 +
k8s_spark/parquet.spark.helm/values.yaml | 92 ++
local.spark.cluster/README.md | 17 +
local.spark.cluster/aws-java-sdk-1.7.4.jar | Bin 0 -> 11948376 bytes
.../aws-java-sdk-bundle-1.11.563.jar.txt | 1 +
local.spark.cluster/build.sh | 33 +
local.spark.cluster/cluster-base.Dockerfile | 19 +
local.spark.cluster/docker-compose.yml | 82 ++
local.spark.cluster/hadoop-aws-3.2.0.jar | Bin 0 -> 480674 bytes
local.spark.cluster/jupyterlab.Dockerfile | 16 +
local.spark.cluster/parquet-flask.Dockerfile | 17 +
local.spark.cluster/spark-base.Dockerfile | 36 +
local.spark.cluster/spark-defaults.conf | 52 +
local.spark.cluster/spark-master.Dockerfile | 8 +
local.spark.cluster/spark-worker.Dockerfile | 8 +
flask_server.py => parquet_flask/__main__.py | 0
parquet_flask/authenticator/__init__.py | 0
.../authenticator/authenticator_abstract.py | 12 +
.../authenticator_aws_secret_manager.py | 37 +
.../authenticator/authenticator_factory.py | 18 +
.../authenticator/authenticator_filebased.py | 33 +
.../authenticator/authenticator_pass_through.py | 11 +
parquet_flask/aws/aws_cred.py | 38 +-
parquet_flask/aws/aws_ddb.py | 4 +-
parquet_flask/aws/aws_s3.py | 31 +-
parquet_flask/aws/aws_secret_manager.py | 46 +
parquet_flask/cdms_lambda_func/__init__.py | 0
.../cdms_lambda_func/ingest_s3_to_cdms/__init__.py | 0
.../ingest_s3_to_cdms/execute_lambda.py | 6 +
.../ingest_s3_to_cdms/ingest_s3_to_cdms.py | 50 +
parquet_flask/cdms_lambda_func/lambda_func_env.py | 5 +
parquet_flask/io_logic/cdms_constants.py | 4 +
parquet_flask/io_logic/cdms_schema.py | 42 +
parquet_flask/io_logic/ingest_new_file.py | 39 +-
parquet_flask/io_logic/metadata_tbl_io.py | 3 +-
.../parquet_query_condition_management_v3.py | 232 +++++
parquet_flask/io_logic/partitioned_parquet_path.py | 130 +++
parquet_flask/io_logic/query.py | 157 ---
parquet_flask/io_logic/query_v2.py | 146 +--
parquet_flask/io_logic/query_v4.py | 136 +++
parquet_flask/io_logic/raw_query.py | 127 +++
parquet_flask/io_logic/retrieve_spark_session.py | 85 +-
parquet_flask/io_logic/spark_constants.py | 8 +
parquet_flask/utils/config.py | 35 +-
parquet_flask/utils/general_utils.py | 35 +
parquet_flask/v1/__init__.py | 3 +-
parquet_flask/v1/authenticator_decorator.py | 23 +
parquet_flask/v1/ingest_aws_json.py | 156 ++-
parquet_flask/v1/ingest_json_s3.py | 8 +
.../v1/{apidocs.py => insitu_query_swagger.py} | 14 +-
.../{apidocs => insitu_query_swagger}/index.html | 0
.../insitu-spec-0.0.1.yml | 190 ++--
parquet_flask/v1/query_data.py | 44 +-
parquet_flask/v1/query_data_doms.py | 26 +-
parquet_flask/v1/replace_json_s3.py | 8 +
s3a.parquet.performance.issue.md | 121 +++
setup.py | 1 +
terraform/cdms-parquet-tf/ddb.tf | 38 +
terraform/cdms-parquet-tf/eks.tf | 0
terraform/cdms-parquet-tf/lambda.tf | 0
terraform/cdms-parquet-tf/main.tf | 14 +
terraform/cdms-parquet-tf/s3.tf | 36 +
terraform/cdms-parquet-tf/variables.tf | 19 +
terraform/cmd-paruqet.tf | 22 +
terraform/main.tf | 4 +
terraform/variables.tf | 19 +
tests/__init__.py | 0
tests/bench_mark/__init__.py | 0
tests/bench_mark/bench_mark.py | 527 ++++++++++
tests/bench_mark/func_exec_time_decorator.py | 17 +
tests/parquet_flask/__init__.py | 0
tests/parquet_flask/io_logic/__init__.py | 0
.../test_parquet_query_condition_management_v3.py | 373 +++++++
.../io_logic/test_partitioned_parquet_path.py | 14 +
tests/parquet_flask/utils/__init__.py | 0
tests/parquet_flask/utils/test_general_utils.py | 26 +
101 files changed, 5541 insertions(+), 499 deletions(-)
create mode 100644 Deployment-in-AWS.md
create mode 100644 docker/parquet.spark.3.1.2.r70.Dockerfile
create mode 100644 docker/parquet.spark.3.2.0.r44.Dockerfile
create mode 100644 documentations/jupyter.notebooks/cdms.demo.2021.12.09.ipynb
create mode 100644
documentations/jupyter.notebooks/in-situ-architecture-demo.png
create mode 100644 documentations/jupyter.notebooks/in-situ-architecture.png
create mode 100644
documentations/jupyter.notebooks/in-situ-parquet-partition.png
create mode 100644 documentations/jupyter.notebooks/in-situ-s3-input-data.png
create mode 100644 k8s_spark/README.md
create mode 100644 k8s_spark/configmap.yml
create mode 100644 k8s_spark/flask-deployment.yml
create mode 100644 k8s_spark/flask-service.yml
create mode 100644 k8s_spark/k8s_spark/values.yaml
create mode 100644 k8s_spark/parquet.spark.helm/.helmignore
create mode 100644 k8s_spark/parquet.spark.helm/Chart.yaml
create mode 100644 k8s_spark/parquet.spark.helm/README.md
create mode 100644 k8s_spark/parquet.spark.helm/templates/NOTES.txt
create mode 100644 k8s_spark/parquet.spark.helm/templates/_helpers.tpl
create mode 100644 k8s_spark/parquet.spark.helm/templates/deployment.yaml
create mode 100644 k8s_spark/parquet.spark.helm/templates/hpa.yaml
create mode 100644 k8s_spark/parquet.spark.helm/templates/ingress.yaml
create mode 100644 k8s_spark/parquet.spark.helm/templates/secret.yaml
create mode 100644 k8s_spark/parquet.spark.helm/templates/service.yaml
create mode 100644 k8s_spark/parquet.spark.helm/templates/serviceaccount.yaml
create mode 100644
k8s_spark/parquet.spark.helm/templates/tests/test-connection.yaml
create mode 100644 k8s_spark/parquet.spark.helm/values.yaml
create mode 100644 local.spark.cluster/README.md
create mode 100644 local.spark.cluster/aws-java-sdk-1.7.4.jar
create mode 100644 local.spark.cluster/aws-java-sdk-bundle-1.11.563.jar.txt
create mode 100755 local.spark.cluster/build.sh
create mode 100644 local.spark.cluster/cluster-base.Dockerfile
create mode 100644 local.spark.cluster/docker-compose.yml
create mode 100644 local.spark.cluster/hadoop-aws-3.2.0.jar
create mode 100644 local.spark.cluster/jupyterlab.Dockerfile
create mode 100644 local.spark.cluster/parquet-flask.Dockerfile
create mode 100644 local.spark.cluster/spark-base.Dockerfile
create mode 100644 local.spark.cluster/spark-defaults.conf
create mode 100644 local.spark.cluster/spark-master.Dockerfile
create mode 100644 local.spark.cluster/spark-worker.Dockerfile
rename flask_server.py => parquet_flask/__main__.py (100%)
create mode 100644 parquet_flask/authenticator/__init__.py
create mode 100644 parquet_flask/authenticator/authenticator_abstract.py
create mode 100644
parquet_flask/authenticator/authenticator_aws_secret_manager.py
create mode 100644 parquet_flask/authenticator/authenticator_factory.py
create mode 100644 parquet_flask/authenticator/authenticator_filebased.py
create mode 100644 parquet_flask/authenticator/authenticator_pass_through.py
create mode 100644 parquet_flask/aws/aws_secret_manager.py
create mode 100644 parquet_flask/cdms_lambda_func/__init__.py
create mode 100644 parquet_flask/cdms_lambda_func/ingest_s3_to_cdms/__init__.py
create mode 100644
parquet_flask/cdms_lambda_func/ingest_s3_to_cdms/execute_lambda.py
create mode 100644
parquet_flask/cdms_lambda_func/ingest_s3_to_cdms/ingest_s3_to_cdms.py
create mode 100644 parquet_flask/cdms_lambda_func/lambda_func_env.py
create mode 100644 parquet_flask/io_logic/cdms_schema.py
create mode 100644
parquet_flask/io_logic/parquet_query_condition_management_v3.py
create mode 100644 parquet_flask/io_logic/partitioned_parquet_path.py
delete mode 100644 parquet_flask/io_logic/query.py
create mode 100644 parquet_flask/io_logic/query_v4.py
create mode 100644 parquet_flask/io_logic/raw_query.py
create mode 100644 parquet_flask/io_logic/spark_constants.py
create mode 100644 parquet_flask/v1/authenticator_decorator.py
rename parquet_flask/v1/{apidocs.py => insitu_query_swagger.py} (73%)
rename parquet_flask/v1/{apidocs => insitu_query_swagger}/index.html (100%)
rename parquet_flask/v1/{apidocs =>
insitu_query_swagger}/insitu-spec-0.0.1.yml (71%)
create mode 100644 s3a.parquet.performance.issue.md
create mode 100644 terraform/cdms-parquet-tf/ddb.tf
create mode 100644 terraform/cdms-parquet-tf/eks.tf
create mode 100644 terraform/cdms-parquet-tf/lambda.tf
create mode 100644 terraform/cdms-parquet-tf/main.tf
create mode 100644 terraform/cdms-parquet-tf/s3.tf
create mode 100644 terraform/cdms-parquet-tf/variables.tf
create mode 100644 terraform/cmd-paruqet.tf
create mode 100644 terraform/main.tf
create mode 100644 terraform/variables.tf
create mode 100644 tests/__init__.py
create mode 100644 tests/bench_mark/__init__.py
create mode 100644 tests/bench_mark/bench_mark.py
create mode 100644 tests/bench_mark/func_exec_time_decorator.py
create mode 100644 tests/parquet_flask/__init__.py
create mode 100644 tests/parquet_flask/io_logic/__init__.py
create mode 100644
tests/parquet_flask/io_logic/test_parquet_query_condition_management_v3.py
create mode 100644
tests/parquet_flask/io_logic/test_partitioned_parquet_path.py
create mode 100644 tests/parquet_flask/utils/__init__.py
create mode 100644 tests/parquet_flask/utils/test_general_utils.py