I was in kind of a wonky state of mind when I posted this, although it is of course somewhat interesting.
My energy quip may have been out of scale here. The below snippet describes passing about 3.5 billion images through a single high-end GPU (each epoch is the entire dataset). The most powerful models involve huge networks of highest-end GPUs . > We split the dataset into a training set and a testing set. The training > set contains 10, 335 images and the testing set contains 1, 149 images. > We downsample the images to 512 Γ 256 resolution. The texture > attribute labels are the combinations of clothes colors and fabrics > annotations. The modules in the whole pipeline are trained stage by > stage. All of our models are trained on one NVIDIA Tesla V100 GPU. > We adopt the Adam optimizer. The learning rate is set as 1 Γ 10β4 > . > For the training of Stage I (i.e., Pose to Parsing), we use the (human > pose, clothes shape labels) pairs as inputs and the labeled human > parsing masks as ground truths. We use the instance channel of > densepose (three-channel IUV maps in original) as the human pose > π. Each shape attribute ππ > is represented as one-hot embeddings. We > train the Stage I module for 50 epochs. The batch size is set as 8. For > the training of hierarchical VQVAE in Stage II, we first train the toplevel codebook, πΈπ‘ππ , > and decoder for 110 epochs, and then train the > bottom-level codebook, πΈπππ‘, and π·πππ‘ for 60 epochs with top-level > related parameters fixed. The batch size is set as 4. The sampler with > mixture-of-experts in Stage II requiresππ ππ andππ‘ππ₯ .ππ ππ is obtained > by a human parsing tokenizer, which is trained by reconstructing > the human parsing maps for 20 epochs with batch size 4. ππ‘ππ₯ is > obtained by directly downsampling the texture instance maps to > the same size of codebook indices maps using nearest interpolation. > The cross-entropy loss is employed for training. The sampler is > trained for 90 epochs with the batch size of 4. For the feed-forward > index prediction network, we use the top-level features and bottomlevel > codebook indices > as the input and ground-truth pairs. The > feed-forward index prediction network is optimized using the crossentropy > loss. The index > prediction network is trained for 45 epochs > and the batch size is set as 4 .
