[GitHub] [incubator-mxnet] salmanmashayekh commented on issue #16230: Loading Sagemaker NTM Artifacts

GitBox Wed, 25 Sep 2019 15:27:33 -0700

salmanmashayekh commented on issue #16230: Loading Sagemaker NTM Artifacts
URL: 
https://github.com/apache/incubator-mxnet/issues/16230#issuecomment-535248289
 
 
   Thank you both @lanking520 and @ThomasDelteil!
   
   Here is a MWE:
   
   ```
   from sklearn.feature_extraction.text import CountVectorizer
   import s3fs
   from sagemaker.amazon.amazon_estimator import get_image_uri
   from sagemaker import get_execution_role
   
   fs = s3fs.S3FileSystem()
   
   
   # create the document corpus
   docs = [
       "Python is an interpreted, high-level, general-purpose programming 
language.",
       "Created by Guido van Rossum and first released in 1991, Python's design 
philosophy emphasizes code readability with its notable use of significant 
whitespace.",
       "Its language constructs and object-oriented approach aim to help 
programmers write clear, logical code for small and large-scale projects.",
       "Python is dynamically typed and garbage-collected.",
       "It supports multiple programming paradigms, including procedural, 
object-oriented, and functional programming.",
       "Python is often described as a batteries included language due to its 
comprehensive standard library.",
       "Python was conceived in the late 1980s as a successor to the ABC 
language.",
       "Python 2.0, released 2000, introduced features like list comprehensions 
and a garbage collection system capable of collecting reference cycles.",
       "Python 3.0, released 2008, was a major revision of the language that is 
not completely backward-compatible, and much Python 2 code does not run 
unmodified on Python 3.",
       "Due to concern about the amount of code written for Python 2, support 
for Python 2.7 (the last release in the 2.x series) was extended to 2020.",
       "Language developer Guido van Rossum shouldered sole responsibility for 
the project until July 2018 but now shares his leadership as a member of a 
five-person steering council.",
       "The Python 2 language, i.e. Python 2.7.x, is sunsetting on January 1, 
2020, and the Python team of volunteers will not fix security issues, or 
improve in other ways after that date.",
       "With the end-of-life, only Python 3.6.x and later, e.g.",
       "Python 3.8 which should be released in October 2019 (currently in 
beta), will be supported.",
       "Python interpreters are available for many operating systems.",
       "A global community of programmers develops and maintains CPython, an 
open source reference implementation.",
       "A non-profit organization, the Python Software Foundation, manages and 
directs resources for Python and CPython development.",
   ]
   DOCS_NO = len(docs)
   
   # fir the vectorizer and transform the docs
   count_vectorizer = CountVectorizer(input="content", max_df=0.8, min_df=5, 
max_features=10)
   transformed_docs = count_vectorizer.fit_transform(docs)
   transformed_docs = transformed_docs.astype(np.float32) # transformed_docs is 
a 17x8 sparse matrix
   
   # s3 params
   BUCKET = "my-bucket"
   PATH = "mxnet-mwe"
   TRAIN_DIR  = f"{PATH}/train"
   AUX_DIR    = f"{PATH}/auxiliary"
   OUTPUT_DIR = f"{PATH}/output"
   
   # store the transformed docs into s3
   with fs.open(f"s3://{BUCKET}/{TRAIN_DIR}/transformed_docs.protobuf", "wb") 
as buf:
       smac.write_spmatrix_to_sparse_tensor(buf, transformed_docs)
       
   # store the vocab into s3
   vocab_dict = {i:w for w,i in count_vectorizer.vocabulary_.items()}
   vocab = [str(vocab_dict[i]) for i in range(len(vocab_dict))]
   with fs.open(f"s3://{BUCKET}/{AUX_DIR}/vocab.txt", "wb") as f:
       f.write(("\n".join(vocab)).encode())
   
   # get the training image, role, and session
   container = get_image_uri(boto3.Session().region_name, 'ntm')
   role = get_execution_role()
   session = sagemaker.Session()
   
   # instantiate the estimator
   ntm = sagemaker.estimator.Estimator(
       container,
       role, 
       train_instance_count = 1, 
       train_instance_type = "ml.p2.xlarge",
       output_path = f"s3://{BUCKET}/{OUTPUT_DIR}",
       sagemaker_session = session,
   )
   
   # set hyper-params
   ntm.set_hyperparameters(
       num_topics = 3,
       feature_dim = len(vocab),
   )
   
   # set training and auxiliary channels
   train_channel = sagemaker.session.s3_input(
       f"s3://{BUCKET}/{TRAIN_DIR}/transformed_docs.protobuf",
       content_type = 'application/x-recordio-protobuf'
   )
   auxiliary_channel = sagemaker.session.s3_input(
       f"s3://{BUCKET}/{AUX_DIR}/vocab.txt",
       content_type = 'text/plain'
   )
   
   # fit the model
   ntm.fit(
       {
           'train': train_channel,
           'auxiliary': auxiliary_channel,
       }
   )
   ```
   
   The training log is as follows:
   
   ```
   2019-09-25 21:56:29 Starting - Starting the training job...
   2019-09-25 21:56:35 Starting - Launching requested ML instances......
   2019-09-25 21:57:34 Starting - Preparing the instances for training......
   2019-09-25 21:58:45 Downloading - Downloading input data...
   2019-09-25 21:59:04 Training - Downloading the training image..Docker 
entrypoint called with argument(s): train
   /opt/amazon/lib/python2.7/site-packages/pandas/util/nosetester.py:13: 
DeprecationWarning: Importing from numpy.testing.nosetester is deprecated, 
import from numpy.testing instead.
     from numpy.testing import nosetester
   [09/25/2019 21:59:36 INFO 140704525621056] Reading default configuration 
from /opt/amazon/lib/python2.7/site-packages/algorithm/default-input.json: 
{u'num_patience_epochs': u'3', u'clip_gradient': u'Inf', u'encoder_layers': 
u'auto', u'optimizer': u'adadelta', u'_kvstore': u'auto_gpu', 
u'rescale_gradient': u'1.0', u'_tuning_objective_metric': u'', u'_num_gpus': 
u'auto', u'learning_rate': u'0.01', u'_data_format': u'record', u'sub_sample': 
u'1.0', u'epochs': u'50', u'weight_decay': u'0.0', u'_num_kv_servers': u'auto', 
u'encoder_layers_activation': u'sigmoid', u'mini_batch_size': u'256', 
u'tolerance': u'0.001', u'batch_norm': u'false'}
   [09/25/2019 21:59:36 INFO 140704525621056] Reading provided configuration 
from /opt/ml/input/config/hyperparameters.json: {u'feature_dim': u'8', 
u'num_topics': u'3'}
   [09/25/2019 21:59:36 INFO 140704525621056] Final configuration: 
{u'num_patience_epochs': u'3', u'clip_gradient': u'Inf', u'encoder_layers': 
u'auto', u'optimizer': u'adadelta', u'_kvstore': u'auto_gpu', 
u'rescale_gradient': u'1.0', u'_tuning_objective_metric': u'', u'_num_gpus': 
u'auto', u'learning_rate': u'0.01', u'_data_format': u'record', u'sub_sample': 
u'1.0', u'epochs': u'50', u'feature_dim': u'8', u'weight_decay': u'0.0', 
u'num_topics': u'3', u'_num_kv_servers': u'auto', u'encoder_layers_activation': 
u'sigmoid', u'mini_batch_size': u'256', u'tolerance': u'0.001', u'batch_norm': 
u'false'}
   [09/25/2019 21:59:37 INFO 140704525621056] nvidia-smi took: 0.0503778457642 
secs to identify 1 gpus
   Process 1 is a worker.
   [09/25/2019 21:59:37 INFO 140704525621056] Using default worker.
   [09/25/2019 21:59:37 INFO 140704525621056] Initializing
   
/opt/amazon/lib/python2.7/site-packages/ai_algorithms_sdk/config/config_helper.py:122:
 DeprecationWarning: deprecated
     warnings.warn("deprecated", DeprecationWarning)
   [09/25/2019 21:59:37 INFO 140704525621056] /opt/ml/input/data/auxiliary
   [09/25/2019 21:59:37 INFO 140704525621056] vocab.txt
   [09/25/2019 21:59:37 INFO 140704525621056] Vocab file vocab.txt is expected 
at /opt/ml/input/data/auxiliary
   [09/25/2019 21:59:37 INFO 140704525621056] Loading pre-trained token 
embedding vectors from 
/opt/amazon/lib/python2.7/site-packages/algorithm/s3_binary/glove.6B.50d.txt
   
   2019-09-25 21:59:58 Uploading - Uploading generated training model
   2019-09-25 21:59:58 Completed - Training job completed
   [09/25/2019 21:59:49 WARNING 140704525621056] 0 out of 8 in vocabulary do 
not have embeddings! Default vector used for unknown embedding!
   [09/25/2019 21:59:49 INFO 140704525621056] Vocab embedding shape
   [09/25/2019 21:59:49 INFO 140704525621056] Number of GPUs being used: 1
   [09/25/2019 21:59:53 INFO 140704525621056] Create Store: device
   #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 
0, "sum": 0.0, "min": 0}, "Number of Batches Since Last Reset": {"count": 1, 
"max": 0, "sum": 0.0, "min": 0}, "Number of Records Since Last Reset": 
{"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Total Batches Seen": {"count": 
1, "max": 0, "sum": 0.0, "min": 0}, "Total Records Seen": {"count": 1, "max": 
0, "sum": 0.0, "min": 0}, "Max Records Seen Between Resets": {"count": 1, 
"max": 0, "sum": 0.0, "min": 0}, "Reset Count": {"count": 1, "max": 0, "sum": 
0.0, "min": 0}}, "EndTime": 1569448793.550828, "Dimensions": {"Host": "algo-1", 
"Meta": "init_train_data_iter", "Operation": "training", "Algorithm": 
"AWS/NTM"}, "StartTime": 1569448793.55079}
   
   [2019-09-25 21:59:53.551] [tensorio] [info] epoch_stats={"data_pipeline": 
"/opt/ml/input/data/train", "epoch": 0, "duration": 16503, "num_examples": 1, 
"num_bytes": 816}
   [09/25/2019 21:59:53 INFO 140704525621056] 
   [09/25/2019 21:59:53 INFO 140704525621056] # Starting training for epoch 1
   [2019-09-25 21:59:53.571] [tensorio] [info] epoch_stats={"data_pipeline": 
"/opt/ml/input/data/train", "epoch": 2, "duration": 19, "num_examples": 1, 
"num_bytes": 816}
   [09/25/2019 21:59:53 INFO 140704525621056] # Finished training epoch 1 on 17 
examples from 1 batches, each of size 256.
   [09/25/2019 21:59:53 INFO 140704525621056] Metrics for Training:
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) total: 
0.13865429163
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) kld: 
0.000402930745622
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) recons: 
0.13825134933
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) logppx: 
0.13865429163
   [09/25/2019 21:59:53 INFO 140704525621056] #quality_metric: host=algo-1, 
epoch=1, train total_loss <loss>=0.13865429163
   [09/25/2019 21:59:53 INFO 140704525621056] Timing: train: 0.03s, val: 0.00s, 
epoch: 0.03s
   [09/25/2019 21:59:53 INFO 140704525621056] #progress_metric: host=algo-1, 
completed 2 % of epochs
   #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 
1, "sum": 1.0, "min": 1}, "Number of Batches Since Last Reset": {"count": 1, 
"max": 1, "sum": 1.0, "min": 1}, "Number of Records Since Last Reset": 
{"count": 1, "max": 17, "sum": 17.0, "min": 17}, "Total Batches Seen": 
{"count": 1, "max": 1, "sum": 1.0, "min": 1}, "Total Records Seen": {"count": 
1, "max": 17, "sum": 17.0, "min": 17}, "Max Records Seen Between Resets": 
{"count": 1, "max": 17, "sum": 17.0, "min": 17}, "Reset Count": {"count": 1, 
"max": 2, "sum": 2.0, "min": 2}}, "EndTime": 1569448793.577662, "Dimensions": 
{"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", 
"Algorithm": "AWS/NTM", "epoch": 0}, "StartTime": 1569448793.551103}
   
   [09/25/2019 21:59:53 INFO 140704525621056] #throughput_metric: host=algo-1, 
train throughput=636.442222897 records/second
   [09/25/2019 21:59:53 INFO 140704525621056] 
   [09/25/2019 21:59:53 INFO 140704525621056] # Starting training for epoch 2
   [2019-09-25 21:59:53.591] [tensorio] [info] epoch_stats={"data_pipeline": 
"/opt/ml/input/data/train", "epoch": 5, "duration": 13, "num_examples": 1, 
"num_bytes": 816}
   [09/25/2019 21:59:53 INFO 140704525621056] # Finished training epoch 2 on 17 
examples from 1 batches, each of size 256.
   [09/25/2019 21:59:53 INFO 140704525621056] Metrics for Training:
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) total: 
0.137792155147
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) kld: 
0.000175397624844
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) recons: 
0.137616753578
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) logppx: 
0.137792155147
   [09/25/2019 21:59:53 INFO 140704525621056] #quality_metric: host=algo-1, 
epoch=2, train total_loss <loss>=0.137792155147
   [09/25/2019 21:59:53 INFO 140704525621056] Timing: train: 0.01s, val: 0.00s, 
epoch: 0.02s
   [09/25/2019 21:59:53 INFO 140704525621056] #progress_metric: host=algo-1, 
completed 4 % of epochs
   #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 
1, "sum": 1.0, "min": 1}, "Number of Batches Since Last Reset": {"count": 1, 
"max": 1, "sum": 1.0, "min": 1}, "Number of Records Since Last Reset": 
{"count": 1, "max": 17, "sum": 17.0, "min": 17}, "Total Batches Seen": 
{"count": 1, "max": 2, "sum": 2.0, "min": 2}, "Total Records Seen": {"count": 
1, "max": 34, "sum": 34.0, "min": 34}, "Max Records Seen Between Resets": 
{"count": 1, "max": 17, "sum": 17.0, "min": 17}, "Reset Count": {"count": 1, 
"max": 4, "sum": 4.0, "min": 4}}, "EndTime": 1569448793.597192, "Dimensions": 
{"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", 
"Algorithm": "AWS/NTM", "epoch": 1}, "StartTime": 1569448793.577978}
   
   [09/25/2019 21:59:53 INFO 140704525621056] #throughput_metric: host=algo-1, 
train throughput=876.919088438 records/second
   [09/25/2019 21:59:53 INFO 140704525621056] 
   [09/25/2019 21:59:53 INFO 140704525621056] # Starting training for epoch 3
   [2019-09-25 21:59:53.609] [tensorio] [info] epoch_stats={"data_pipeline": 
"/opt/ml/input/data/train", "epoch": 8, "duration": 11, "num_examples": 1, 
"num_bytes": 816}
   [09/25/2019 21:59:53 INFO 140704525621056] # Finished training epoch 3 on 17 
examples from 1 batches, each of size 256.
   [09/25/2019 21:59:53 INFO 140704525621056] Metrics for Training:
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) total: 
0.136722803116
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) kld: 
5.14156054123e-05
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) recons: 
0.136671379209
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) logppx: 
0.136722803116
   [09/25/2019 21:59:53 INFO 140704525621056] #quality_metric: host=algo-1, 
epoch=3, train total_loss <loss>=0.136722803116
   [09/25/2019 21:59:53 INFO 140704525621056] Timing: train: 0.01s, val: 0.00s, 
epoch: 0.02s
   [09/25/2019 21:59:53 INFO 140704525621056] #progress_metric: host=algo-1, 
completed 6 % of epochs
   #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 
1, "sum": 1.0, "min": 1}, "Number of Batches Since Last Reset": {"count": 1, 
"max": 1, "sum": 1.0, "min": 1}, "Number of Records Since Last Reset": 
{"count": 1, "max": 17, "sum": 17.0, "min": 17}, "Total Batches Seen": 
{"count": 1, "max": 3, "sum": 3.0, "min": 3}, "Total Records Seen": {"count": 
1, "max": 51, "sum": 51.0, "min": 51}, "Max Records Seen Between Resets": 
{"count": 1, "max": 17, "sum": 17.0, "min": 17}, "Reset Count": {"count": 1, 
"max": 6, "sum": 6.0, "min": 6}}, "EndTime": 1569448793.615088, "Dimensions": 
{"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", 
"Algorithm": "AWS/NTM", "epoch": 2}, "StartTime": 1569448793.597521}
   
   [09/25/2019 21:59:53 INFO 140704525621056] #throughput_metric: host=algo-1, 
train throughput=958.182731976 records/second
   [09/25/2019 21:59:53 INFO 140704525621056] 
   [09/25/2019 21:59:53 INFO 140704525621056] # Starting training for epoch 4
   [2019-09-25 21:59:53.628] [tensorio] [info] epoch_stats={"data_pipeline": 
"/opt/ml/input/data/train", "epoch": 11, "duration": 12, "num_examples": 1, 
"num_bytes": 816}
   [09/25/2019 21:59:53 INFO 140704525621056] # Finished training epoch 4 on 17 
examples from 1 batches, each of size 256.
   [09/25/2019 21:59:53 INFO 140704525621056] Metrics for Training:
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) total: 
0.136693164706
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) kld: 
4.85915334139e-05
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) recons: 
0.13664457202
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) logppx: 
0.136693164706
   [09/25/2019 21:59:53 INFO 140704525621056] #quality_metric: host=algo-1, 
epoch=4, train total_loss <loss>=0.136693164706
   [09/25/2019 21:59:53 INFO 140704525621056] patience 
losses:[0.13865429162979126, 0.13779215514659882, 0.13672280311584473] min 
patience loss:0.136722803116 current loss:0.136693164706 absolute loss 
difference:2.96384096146e-05
   [09/25/2019 21:59:53 INFO 140704525621056] Bad epoch: loss has not improved 
(enough). Bad count:1
   [09/25/2019 21:59:53 INFO 140704525621056] Timing: train: 0.01s, val: 0.00s, 
epoch: 0.02s
   [09/25/2019 21:59:53 INFO 140704525621056] #progress_metric: host=algo-1, 
completed 8 % of epochs
   #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 
1, "sum": 1.0, "min": 1}, "Number of Batches Since Last Reset": {"count": 1, 
"max": 1, "sum": 1.0, "min": 1}, "Number of Records Since Last Reset": 
{"count": 1, "max": 17, "sum": 17.0, "min": 17}, "Total Batches Seen": 
{"count": 1, "max": 4, "sum": 4.0, "min": 4}, "Total Records Seen": {"count": 
1, "max": 68, "sum": 68.0, "min": 68}, "Max Records Seen Between Resets": 
{"count": 1, "max": 17, "sum": 17.0, "min": 17}, "Reset Count": {"count": 1, 
"max": 8, "sum": 8.0, "min": 8}}, "EndTime": 1569448793.633979, "Dimensions": 
{"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", 
"Algorithm": "AWS/NTM", "epoch": 3}, "StartTime": 1569448793.615416}
   
   [09/25/2019 21:59:53 INFO 140704525621056] #throughput_metric: host=algo-1, 
train throughput=908.019866032 records/second
   [09/25/2019 21:59:53 INFO 140704525621056] 
   [09/25/2019 21:59:53 INFO 140704525621056] # Starting training for epoch 5
   [2019-09-25 21:59:53.648] [tensorio] [info] epoch_stats={"data_pipeline": 
"/opt/ml/input/data/train", "epoch": 14, "duration": 14, "num_examples": 1, 
"num_bytes": 816}
   [09/25/2019 21:59:53 INFO 140704525621056] # Finished training epoch 5 on 17 
examples from 1 batches, each of size 256.
   [09/25/2019 21:59:53 INFO 140704525621056] Metrics for Training:
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) total: 
0.136731252074
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) kld: 
3.38535865012e-05
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) recons: 
0.136697411537
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) logppx: 
0.136731252074
   [09/25/2019 21:59:53 INFO 140704525621056] #quality_metric: host=algo-1, 
epoch=5, train total_loss <loss>=0.136731252074
   [09/25/2019 21:59:53 INFO 140704525621056] patience 
losses:[0.13779215514659882, 0.13672280311584473, 0.13669316470623016] min 
patience loss:0.136693164706 current loss:0.136731252074 absolute loss 
difference:3.80873680115e-05
   [09/25/2019 21:59:53 INFO 140704525621056] Bad epoch: loss has not improved 
(enough). Bad count:2
   [09/25/2019 21:59:53 INFO 140704525621056] Timing: train: 0.02s, val: 0.00s, 
epoch: 0.02s
   [09/25/2019 21:59:53 INFO 140704525621056] #progress_metric: host=algo-1, 
completed 10 % of epochs
   #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 
1, "sum": 1.0, "min": 1}, "Number of Batches Since Last Reset": {"count": 1, 
"max": 1, "sum": 1.0, "min": 1}, "Number of Records Since Last Reset": 
{"count": 1, "max": 17, "sum": 17.0, "min": 17}, "Total Batches Seen": 
{"count": 1, "max": 5, "sum": 5.0, "min": 5}, "Total Records Seen": {"count": 
1, "max": 85, "sum": 85.0, "min": 85}, "Max Records Seen Between Resets": 
{"count": 1, "max": 17, "sum": 17.0, "min": 17}, "Reset Count": {"count": 1, 
"max": 10, "sum": 10.0, "min": 10}}, "EndTime": 1569448793.649995, 
"Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": 
"training", "Algorithm": "AWS/NTM", "epoch": 4}, "StartTime": 1569448793.634279}
   
   [09/25/2019 21:59:53 INFO 140704525621056] #throughput_metric: host=algo-1, 
train throughput=1073.22869442 records/second
   [09/25/2019 21:59:53 INFO 140704525621056] 
   [09/25/2019 21:59:53 INFO 140704525621056] # Starting training for epoch 6
   [2019-09-25 21:59:53.665] [tensorio] [info] epoch_stats={"data_pipeline": 
"/opt/ml/input/data/train", "epoch": 17, "duration": 14, "num_examples": 1, 
"num_bytes": 816}
   [09/25/2019 21:59:53 INFO 140704525621056] # Finished training epoch 6 on 17 
examples from 1 batches, each of size 256.
   [09/25/2019 21:59:53 INFO 140704525621056] Metrics for Training:
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) total: 
0.135983735323
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) kld: 
2.08785422728e-05
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) recons: 
0.135962858796
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) logppx: 
0.135983735323
   [09/25/2019 21:59:53 INFO 140704525621056] #quality_metric: host=algo-1, 
epoch=6, train total_loss <loss>=0.135983735323
   [09/25/2019 21:59:53 INFO 140704525621056] patience 
losses:[0.13672280311584473, 0.13669316470623016, 0.13673125207424164] min 
patience loss:0.136693164706 current loss:0.135983735323 absolute loss 
difference:0.000709429383278
   [09/25/2019 21:59:53 INFO 140704525621056] Bad epoch: loss has not improved 
(enough). Bad count:3
   [09/25/2019 21:59:53 INFO 140704525621056] Timing: train: 0.02s, val: 0.01s, 
epoch: 0.02s
   [09/25/2019 21:59:53 INFO 140704525621056] #progress_metric: host=algo-1, 
completed 12 % of epochs
   #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 
1, "sum": 1.0, "min": 1}, "Number of Batches Since Last Reset": {"count": 1, 
"max": 1, "sum": 1.0, "min": 1}, "Number of Records Since Last Reset": 
{"count": 1, "max": 17, "sum": 17.0, "min": 17}, "Total Batches Seen": 
{"count": 1, "max": 6, "sum": 6.0, "min": 6}, "Total Records Seen": {"count": 
1, "max": 102, "sum": 102.0, "min": 102}, "Max Records Seen Between Resets": 
{"count": 1, "max": 17, "sum": 17.0, "min": 17}, "Reset Count": {"count": 1, 
"max": 12, "sum": 12.0, "min": 12}}, "EndTime": 1569448793.673135, 
"Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": 
"training", "Algorithm": "AWS/NTM", "epoch": 5}, "StartTime": 1569448793.650263}
   
   [09/25/2019 21:59:53 INFO 140704525621056] #throughput_metric: host=algo-1, 
train throughput=737.784344767 records/second
   [09/25/2019 21:59:53 INFO 140704525621056] 
   [09/25/2019 21:59:53 INFO 140704525621056] # Starting training for epoch 7
   [2019-09-25 21:59:53.691] [tensorio] [info] epoch_stats={"data_pipeline": 
"/opt/ml/input/data/train", "epoch": 20, "duration": 17, "num_examples": 1, 
"num_bytes": 816}
   [09/25/2019 21:59:53 INFO 140704525621056] # Finished training epoch 7 on 17 
examples from 1 batches, each of size 256.
   [09/25/2019 21:59:53 INFO 140704525621056] Metrics for Training:
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) total: 
0.135596558452
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) kld: 
5.20053035871e-05
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) recons: 
0.135544538498
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) logppx: 
0.135596558452
   [09/25/2019 21:59:53 INFO 140704525621056] #quality_metric: host=algo-1, 
epoch=7, train total_loss <loss>=0.135596558452
   [09/25/2019 21:59:53 INFO 140704525621056] patience 
losses:[0.13669316470623016, 0.13673125207424164, 0.13598373532295227] min 
patience loss:0.135983735323 current loss:0.135596558452 absolute loss 
difference:0.0003871768713
   [09/25/2019 21:59:53 INFO 140704525621056] Bad epoch: loss has not improved 
(enough). Bad count:4
   [09/25/2019 21:59:53 INFO 140704525621056] Bad epochs exceeded patience. 
Stopping training early!
   [09/25/2019 21:59:53 INFO 140704525621056] Timing: train: 0.02s, val: 0.00s, 
epoch: 0.02s
   [09/25/2019 21:59:53 INFO 140704525621056] Early stop condition met. 
Stopping training.
   [09/25/2019 21:59:53 INFO 140704525621056] #progress_metric: host=algo-1, 
completed 100 % epochs
   #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 
1, "sum": 1.0, "min": 1}, "Number of Batches Since Last Reset": {"count": 1, 
"max": 1, "sum": 1.0, "min": 1}, "Number of Records Since Last Reset": 
{"count": 1, "max": 17, "sum": 17.0, "min": 17}, "Total Batches Seen": 
{"count": 1, "max": 7, "sum": 7.0, "min": 7}, "Total Records Seen": {"count": 
1, "max": 119, "sum": 119.0, "min": 119}, "Max Records Seen Between Resets": 
{"count": 1, "max": 17, "sum": 17.0, "min": 17}, "Reset Count": {"count": 1, 
"max": 14, "sum": 14.0, "min": 14}}, "EndTime": 1569448793.696538, 
"Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": 
"training", "Algorithm": "AWS/NTM", "epoch": 6}, "StartTime": 1569448793.673541}
   
   [09/25/2019 21:59:53 INFO 140704525621056] #throughput_metric: host=algo-1, 
train throughput=733.51886181 records/second
   [09/25/2019 21:59:53 WARNING 140704525621056] wait_for_all_workers will not 
sync workers since the kv store is not running distributed
   [09/25/2019 21:59:53 INFO 140704525621056] Best model based on early 
stopping at epoch 7. Best loss: 0.135596558452
   [09/25/2019 21:59:53 INFO 140704525621056] Topics from epoch:final 
(num_topics:3) [wetc 0.63, tu 0.33]:
   [09/25/2019 21:59:53 INFO 140704525621056] [0.63, 0.33] python and the for 
in is language of
   [09/25/2019 21:59:53 INFO 140704525621056] [0.63, 0.33] in and of python the 
for is language
   [09/25/2019 21:59:53 INFO 140704525621056] [0.63, 0.33] python and language 
for of the in is
   [09/25/2019 21:59:53 INFO 140704525621056] Serializing model to 
/opt/ml/model/model_algo-1
   [09/25/2019 21:59:53 INFO 140704525621056] Saved checkpoint to 
"/tmp/tmp3kswyI/state-0001.params"
   [09/25/2019 21:59:53 INFO 140704525621056] Test data is not provided.
   #metrics {"Metrics": {"totaltime": {"count": 1, "max": 16767.024993896484, 
"sum": 16767.024993896484, "min": 16767.024993896484}, "finalize.time": 
{"count": 1, "max": 12.12000846862793, "sum": 12.12000846862793, "min": 
12.12000846862793}, "initialize.time": {"count": 1, "max": 16501.816987991333, 
"sum": 16501.816987991333, "min": 16501.816987991333}, "model.serialize.time": 
{"count": 1, "max": 3.983020782470703, "sum": 3.983020782470703, "min": 
3.983020782470703}, "setuptime": {"count": 1, "max": 66.67494773864746, "sum": 
66.67494773864746, "min": 66.67494773864746}, "early_stop.time": {"count": 7, 
"max": 6.479024887084961, "sum": 24.256229400634766, "min": 
0.23102760314941406}, "update.time": {"count": 7, "max": 26.340961456298828, 
"sum": 141.98827743530273, "min": 15.537023544311523}, "epochs": {"count": 1, 
"max": 50, "sum": 50.0, "min": 50}}, "EndTime": 1569448793.714689, 
"Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": 
"AWS/NTM"}, "StartTime": 1569448777.046916}
   ```
   
   When the training is done, it generates the following zip file, which 
includes a `metadata`, a `symbol`, and a `parameters` file: 
https://drive.google.com/open?id=1TLnIrnmB1SzPqN7cgyECql1Ri74isPKG
   
   I am trying to load the `symbol`/`parameters` file into an mxnet model, so 
that I can make predictions locally (outside of Sagemaker). 
   
   Can you share the code snippet to create an mxnet model with the artifacts 
and predict on the `transformed_docs`?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] salmanmashayekh commented on issue #16230: Loading Sagemaker NTM Artifacts

Reply via email to